21 Haziran 2012 Perşembe

Mixture of Normal Distributions

To contact us Click HERE
In this post I show a simple illustration of a mixture of normal distributions. For the examples, we assume we have metric values that we suppose are generated by a mixture of two different normal distributions, which I'll call clusters. We don't know which datum came from each cluster. Our goal is to estimate the probability that each score came from each of the two clusters, and the means and SD of the normal distributions that describe the clusters.

The model specification (for JAGS): The assumes that the clusters have the same standard deviation, but different means.

model {
    # Likelihood:
    for( i in 1 : N ) {
      y[i] ~ dnorm( mu[i] , tau )
      mu[i] <- muOfClust[ clust[i] ]
      clust[i] ~ dcat( pClust[1:Nclust] )
    }
    # Prior:
    tau ~ dgamma( 0.01 , 0.01 )
    for ( clustIdx in 1: Nclust ) {
      muOfClust[clustIdx] ~ dnorm( 0 , 1.0E-10 )
    }
    pClust[1:Nclust] ~ ddirch( onesRepNclust )
}


The data specification:

# Generate random data from known parameter values:
set.seed(47405)
trueM1 = 100
N1 = 200
trueM2 = 145 # 145 for first example below; 130 for second example
N2 = 200
trueSD = 15
effsz = abs( trueM2 - trueM1 ) / trueSD
y1 = rnorm( N1 )
y1 = (y1-mean(y1))/sd(y1) * trueSD + trueM1
y2 = rnorm( N2 )
y2 = (y2-mean(y2))/sd(y2) * trueSD + trueM2
y = c( y1 , y2 )
N = length(y)

# Must have at least one data point with fixed assignment 
# to each cluster, otherwise some clusters will end up empty:
Nclust = 2
clust = rep(NA,N) 

clust[which.min(y)]=1 # smallest value assigned to cluster 1
clust[which.max(y)]=2 # highest value assigned to cluster 2 
dataList = list(
    y = y ,
    N = N ,
    Nclust = Nclust ,
    clust = clust ,
    onesRepNclust = rep(1,Nclust)
)

Results when mean of cluster 2 is 3 standard deviations away from mean of cluster 1: The posterior recovers the generating values fairly well.

Upper panel: Data with underlying normal generators.
Lower panel: For each datum, the posterior probability that it is assigned to cluster 2.

Marginal posterior on cluster means and SD.

Pairs plot of cluster means and SD.


Results when mean of cluster 2 is 2 standard deviations away from mean of cluster 1: There is lots of uncertainty. See captions for discussion.

Lower panel: Notice that the lowest and highest data values have fixed cluster assignments, but all the other data values have posterior probabilities of cluster assignment noticeably far from 0 or 1.

Notice the bimodal distribution of sigma (SD).

Notice in the in the right column that when sigma is small, around 15, then the cluster means are near their true generating values. But when sigma is large, then the cluster means get close together. Essentially, there is a bimodal posterior: Either there are two clusters, with smaller sigma and distinct means, or there is one cluster, with larger sigma and both cluster means set near the mean of the one cluster.


Hiç yorum yok:

Yorum Gönder