MCMC
What is MCMC
The biggest problem probabilistic in modeling is solving complex integrals
- MCMC = Method of Sampling
- Can be used to estimate moments of distribution, contours etc..
- Don't need functional form of posterior
- Target distribution Doesn't need to be normalized
- Alternatives
- Direct simulation (simple distributions only)
- Simple Bayesian -> Use conjugate prior (if posterior shape is reasonable)
These alternatives are more desirable, but generally not available for hierarchical models
Intuition
- It simpler to implement than you may think.
Monte Carlo
- Method of obtaining samples from a distribution
- Generate a number from the uniform distribution
- Transform sample (transformation theory)
- MC alone works only if you know functional form of the distribution
Markov-Chains
- Directional Graph
- Nodes represents states
- Edges represent transitions
- Edge weights are transition probabilities
- Probability of being in a specific state is dependent upon last state
Markov-Chain Monte-Carlo
- The Markov chain is acyclic and generative
- states are previous proposals
- transitions probabilities are generated from state value
- next possible state is the same or generated from the current state
Acceptance/Rejection algorithm
- Allows samples to be accepted from ares areas of higher probability
- MCMC without this = random walk
- Metropolis-Hastings is by far the most popular algorithm
- Not to be confused with 'proposal algorithms'
Metropolis-Hastings
- General case of many different algorithms
- Metropolis
- Gibbs Sampler
- Hamiltonian MCMC
- …
Initialization
- f(x) is a density function proportional to the target distribution (ei doesn't have to be normalized)
- A. choose a starting value x0
- B. choose a proposal (arbitrary) proposal distribution q
- poisson/normal/uniform are common -> "Metropolis Algorithm"
- perturbs the the current state of the chain
The Algorithm
for i in 1:nSamples
xnew ∼ q(xt − 1)
calculate acceptance ratio. Probability of accepting xnew
$$A=\frac{f(x_{new})}{f(x_{t-1})}\frac{q(x_{t-1}|x_{new})}{q(x_{new}|x_{t-1})}$$
if proposal distribution is symmetric, proposal ratio = 1 -> metropolis algorithm
Generate u ∼ Uniform(0,1)
Choosing a proposal Distributions
- Different proposal distributions can be more efficient at exploring the target distribution
- Bimodal distribution example - may need way more samples to converge
- Consider model constraints/bounds
- Posterior distribution for variance parameter, samples never less than zero -> asymmetric proposal distribution
- Consider shape of target distribution
- Highly kurtotic -> should use a proposal distribution with higher variance
Basic Examples
Example Code: Metropolis - Normal Distribution, few samples
nSamp <- 100
#INITIALIZATION
x = rep(0,nSamp+1)
x[1] = 3
#ALGORITHM
for (i in 1:nSamp){
#PROPOSAL BASED UPON PREVIOUS X
P = rnorm(1,x[i],sd=1)
#ACCEPTANCE RATIO
A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )
if (runif(1) > A){ #GENERATE UNIFORM PROBABILITY
x[i+1] = x[i] #REJECT NEW, ACCEPT OLD
} else{
x[i+1] = P #REJECT OLD, ACCEPT NEW
}
}
yl=c(-5,5)
xseq <- seq(yl[1],yl[2],.01)
dtrue <- dnorm(xseq,mean=0,sd=1)
t <- c(1:(nSamp+1))
d.y <- density(x)
lmat <- c(1,2,3,3,
1,2,3,3,
1,2,3,3)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)
plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
Example Output
Example: Metropolis - Normal Distribution, many samples
nSamp <- 100000
#INITIALIZATION
x = rep(0,nSamp+1)
x[1] = 3
#ALGORITHM
for (i in 1:nSamp){
#PROPOSAL BASED UPON PREVIOUS X
P = rnorm(1,x[i],sd=1)
#ACCEPTANCE RATIO
A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )
if (runif(1) > A){ #GENERATE UNIFORM PROBABILITY
x[i+1] = x[i] #REJECT NEW, ACCEPT OLD
} else{
x[i+1] = P #REJECT OLD, ACCEPT NEW
}
}
yl=c(-5,5)
xseq <- seq(yl[1],yl[2],.01)
dtrue <- dnorm(xseq,mean=0,sd=1)
t <- c(1:(nSamp+1))
d.y <- density(x)
lmat <- c(1,2,3,3,
1,2,3,3,
1,2,3,3)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)
plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
Example Output
Example: Metropolis - Complex Bayes (2 logN observation, exponential prior)
nSamp <- 100000
#INITIALIZATION
x = rep(0,nSamp+1)
x[1] = 3
#ALGORITHM
for (i in 1:nSamp){
#PROPOSAL BASED UPON PREVIOUS X
P = rnorm(1,x[i],sd=1)
#ACCEPTANCE RATIO
A = ( dexp(P)^2 * dlnorm(P,meanlog=4,sdlog=2))/( dexp(x[i])^2 * dlnorm(x[i],meanlog=4,sdlog=2) )
if (runif(1) > A){ #GENERATE UNIFORM PROBABILITY
x[i+1] = x[i] #REJECT NEW, ACCEPT OLD
} else{
x[i+1] = P #REJECT OLD, ACCEPT NEW
}
}
yl=c(-5,5)
#xseq <- seq(yl[1],yl[2],.01)
#dtrue <- dnorm(xseq,mean=0,sd=1)
t <- c(1:(nSamp+1))
d.y <- density(x)
lmat <- c(1,2,2,
1,2,2,
1,2,2)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)
#plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
Example Output
Issues
Starting in state of extremely low probability
- Solution - Discard burn in
- Burn in - Time it takes to stabilize
- can be arbitrary - make sure to check iteration
- generally 1000-5000
Sparse distribution areas
- Solutions
- Larger sample size
- Repeat with a range of different initial values
- Use a better suited proposal distribution/algorithm
Autocorrelation of samples from Markov Chain
- Solution - thinning
- keep every 3rd sample, discard others
Multivariate distributions
Curse of Dimensionality
- Solutions
- Use Gibbs sampler
- Use Hamiltonian MCMC
Proposal algorithms
Metropolis
- Proposal is simply drawn from a uniform or normal distribution
- Hastings term = 1
- Discovered before MCMH, so naming is confusing
Gibbs sampler
When to use
- When you know the postrior distribution (required)
- With multivariate distributions -> breaks down proposal into smaller proposals according to dimension
- Conditional distribution of each variable is known and easy to sample from
Hamiltonian (variation of metropolis Hastings)
- Super great for Bayesian problems
- Only steps 1 & 2 are different
When to use
- Continuous distributions (required)
- Whenever possible…
- Stan
Benefits
- Reduces correlation between sampled states
- Converges much faster - immune to stuck states (torus distribution)
Problems
- U-turns - samples will begin to look very similar after enough samples
- Use the "No U-Turn sampler" (NUTS) variant in Stan
- Multi-modal distributions
- Can't be used discrete distributions
Intuition
Use quantum particle physics as an analogy
1. Think of probability density as inverse energy well
- Hockey puck analogy
- Particle will spend more time in the bottom of the well (lower energy states)
- Take advantage of this in order to be more efficient (grab more proposals from high probabilistic areas)
2. Energy stats
…are discrete
- If particle absorbs energy, it can jump states
- Particle is constantly absorbing and radiating
Summary of energy states -> canoncial distribution
- $$p(E_i) \propto e^{\frac{E_i}{T}} \text{, when Boltzmann constant = 1}$$
- (Gibbs Canonical distribution)
Take advantage of energy landscape in order to better explore explore energy landscape
- Using joint distributions
Every time we accept new proposal of p, the system is at a different energy level
Relation of Hamiltonian to a posterior density
Proposals
- p ∼ N(0,1)
- θ from path of the particle in both dimensions
Solve for path of particle from U & K
$$\frac{dP}{dt}=-\frac{\partial H}{\partial \theta}$$
$$\frac{d\theta}{dt}=-\frac{\partial H}{\partial P}$$
Continuously by small enough steps
For t amount of time
Flip sign of momentum after solving for last step of path
Generate proposals θnew and Pnew
Leapfrog integrator
Acceptance ratio
$$A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)} \frac{p(\theta_{t-1},p_{t-1}|\theta_{new},p_{new})}{p(\theta_{new},p_{new}|\theta_{t-1},p_{t-1})}$$
- Paths are deterministic
- so forward term = 1, but
- reverse term = 1, (not intuitive)
$$\\ A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)}$$
Stan
- Super powerful once you know the syntax (feels like R)
- Full Bayesian inference and maximum likelihood estimation
- Input your model and parameters into the correct sections
- Everything is converted to run in C
Summary
- MCMC is (mostly) easy
- MCMC Metropolis Hastings is the general case
- Proposal algorithm can vary depending on your needs
- To ensure robust sampling:
- Check your sampling, update your procedure
- Use Large sample sizes
- Apply Thinning
- Discard burn-in
- Use the correct proposal algorithm for the job
- Hamiltonian MCMC is fantastic for moderate to complex models
New
Importance samling
Tempered Gibbs sampling
Tempering
exploration vs exploitation
asymptotic variance
single and double flip update