MCMC

Dave White

2018-08-14

MCMC

What is MCMC

The biggest problem probabilistic in modeling is solving complex integrals

MCMC = Method of Sampling
- Can be used to estimate moments of distribution, contours etc..
- Don't need functional form of posterior
- Target distribution Doesn't need to be normalized
Alternatives
- Direct simulation (simple distributions only)
- Simple Bayesian -> Use conjugate prior (if posterior shape is reasonable)
These alternatives are more desirable, but generally not available for hierarchical models

Intuition

It simpler to implement than you may think.
- twitter

Monte Carlo

Method of obtaining samples from a distribution
- Generate a number from the uniform distribution
- Transform sample (transformation theory)
MC alone works only if you know functional form of the distribution

Markov-Chains

Directional Graph
- Nodes represents states
- Edges represent transitions
- Edge weights are transition probabilities
Probability of being in a specific state is dependent upon last state

Markov-Chain Monte-Carlo

The Markov chain is acyclic and generative
- states are previous proposals
- transitions probabilities are generated from state value
- next possible state is the same or generated from the current state

Acceptance/Rejection algorithm

Allows samples to be accepted from ares areas of higher probability
- MCMC without this = random walk
Metropolis-Hastings is by far the most popular algorithm
- Not to be confused with 'proposal algorithms'

Metropolis-Hastings

General case of many different algorithms
- Metropolis
- Gibbs Sampler
- Hamiltonian MCMC
- …

Initialization

f(x) is a density function proportional to the target distribution (ei doesn't have to be normalized)
- eg. posterior density
A. choose a starting value x₀
B. choose a proposal (arbitrary) proposal distribution q
- poisson/normal/uniform are common -> "Metropolis Algorithm"
- perturbs the the current state of the chain

The Algorithm

for i in 1:nSamples
1. x_new ∼ q(x_t − 1)
2. calculate acceptance ratio. Probability of accepting x_new
  
  $$A=\frac{f(x_{new})}{f(x_{t-1})}\frac{q(x_{t-1}|x_{new})}{q(x_{new}|x_{t-1})}$$
  if proposal distribution is symmetric, proposal ratio = 1 -> metropolis algorithm
3. Generate u ∼ Uniform(0,1)
4. - if u ≤ A,
    - accept x_new
  - else,
    - reuse x_t − 1

Choosing a proposal Distributions

Different proposal distributions can be more efficient at exploring the target distribution
- Bimodal distribution example - may need way more samples to converge
Consider model constraints/bounds
- Posterior distribution for variance parameter, samples never less than zero -> asymmetric proposal distribution
Consider shape of target distribution
- Highly kurtotic -> should use a proposal distribution with higher variance

Basic Examples

Example Code: Metropolis - Normal Distribution, few samples

nSamp <- 100

#INITIALIZATION
x = rep(0,nSamp+1)
x[1] = 3

#ALGORITHM
for (i in 1:nSamp){

  #PROPOSAL BASED UPON PREVIOUS X
  P = rnorm(1,x[i],sd=1)

  #ACCEPTANCE RATIO
  A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )

  if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
    x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
  } else{
    x[i+1] = P                   #REJECT OLD, ACCEPT NEW
  }
}

yl=c(-5,5)
xseq <- seq(yl[1],yl[2],.01)
dtrue <- dnorm(xseq,mean=0,sd=1)

t <- c(1:(nSamp+1))
d.y <- density(x)

lmat <- c(1,2,3,3,
          1,2,3,3,
          1,2,3,3)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)

plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')

Example Output

Example: Metropolis - Normal Distribution, many samples

nSamp <- 100000

#INITIALIZATION
x = rep(0,nSamp+1)
x[1] = 3

#ALGORITHM
for (i in 1:nSamp){

  #PROPOSAL BASED UPON PREVIOUS X
  P = rnorm(1,x[i],sd=1)

  #ACCEPTANCE RATIO
  A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )

  if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
    x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
  } else{
    x[i+1] = P                   #REJECT OLD, ACCEPT NEW
  }
}

yl=c(-5,5)
xseq <- seq(yl[1],yl[2],.01)
dtrue <- dnorm(xseq,mean=0,sd=1)

t <- c(1:(nSamp+1))
d.y <- density(x)

lmat <- c(1,2,3,3,
          1,2,3,3,
          1,2,3,3)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)

plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')

Example Output

Example: Metropolis - Complex Bayes (2 logN observation, exponential prior)

nSamp <- 100000

#INITIALIZATION
x = rep(0,nSamp+1)
x[1] = 3

#ALGORITHM
for (i in 1:nSamp){

  #PROPOSAL BASED UPON PREVIOUS X
  P = rnorm(1,x[i],sd=1)

  #ACCEPTANCE RATIO
  A = ( dexp(P)^2 * dlnorm(P,meanlog=4,sdlog=2))/( dexp(x[i])^2 * dlnorm(x[i],meanlog=4,sdlog=2) )

  if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
    x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
  } else{
    x[i+1] = P                   #REJECT OLD, ACCEPT NEW
  }
}

yl=c(-5,5)
#xseq <- seq(yl[1],yl[2],.01)
#dtrue <- dnorm(xseq,mean=0,sd=1)

t <- c(1:(nSamp+1))
d.y <- density(x)

lmat <- c(1,2,2,
          1,2,2,
          1,2,2)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)

#plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')

Example Output

Issues

Starting in state of extremely low probability

Solution - Discard burn in
- Burn in - Time it takes to stabilize
- can be arbitrary - make sure to check iteration
- generally 1000-5000

Sparse distribution areas

Solutions
- Larger sample size
- Repeat with a range of different initial values
- Use a better suited proposal distribution/algorithm

Autocorrelation of samples from Markov Chain

Solution - thinning
- keep every 3rd sample, discard others

Multivariate distributions

Curse of Dimensionality

Solutions
- Use Gibbs sampler
- Use Hamiltonian MCMC

Proposal algorithms

Metropolis

Proposal is simply drawn from a uniform or normal distribution
Hastings term = 1
Discovered before MCMH, so naming is confusing

Gibbs sampler

When to use

When you know the postrior distribution (required)
With multivariate distributions -> breaks down proposal into smaller proposals according to dimension
Conditional distribution of each variable is known and easy to sample from

Hamiltonian (variation of metropolis Hastings)

Super great for Bayesian problems
Only steps 1 & 2 are different
- … but very different

When to use

Continuous distributions (required)
Whenever possible…
Stan

Benefits

Reduces correlation between sampled states
Converges much faster - immune to stuck states (torus distribution)
- "less random" proposals

Problems

U-turns - samples will begin to look very similar after enough samples
- Use the "No U-Turn sampler" (NUTS) variant in Stan
  - easier to use anyways
Multi-modal distributions
- Still the best solution
Can't be used discrete distributions

Intuition

Use quantum particle physics as an analogy

1. Think of probability density as inverse energy well

Hockey puck analogy
- Particle will spend more time in the bottom of the well (lower energy states)
- Take advantage of this in order to be more efficient (grab more proposals from high probabilistic areas)

2. Energy stats

…are discrete
- If particle absorbs energy, it can jump states
- Particle is constantly absorbing and radiating
Summary of energy states -> canoncial distribution
- $$p(E_i) \propto e^{\frac{E_i}{T}} \text{, when Boltzmann constant = 1}$$
- (Gibbs Canonical distribution)
Take advantage of energy landscape in order to better explore explore energy landscape
- Using joint distributions
Every time we accept new proposal of p, the system is at a different energy level

Relation of Hamiltonian to a posterior density

Some (perhaps) familiar equations
- $$p(E) \propto e^{\frac{E(\theta,P)}{T}}\\$$
- $$E(\theta,P)= H \equiv U(\theta) + K(P) = \text{const.}\\$$
- $$K(P)=\sum^d_{i=1}\frac{P_i^2}{2m}\\$$
- $$U(\theta) = -\log (p(x|\theta) p(\theta)) = \text{NLP}\\$$
when mass and T = 1,
- $$p(\theta,P) \propto p(x|\theta)p(\theta)e^\frac{P^2}{2}\\$$
  - $$\equiv p(x|\theta)p(\theta)N(P|0,1)\\$$
Note:
- $$\int p(\theta,P)\ dp = \frac{1}{z} p(x|\theta)p(\theta) \int N(P|0,1)\ dp\\$$
- $$p(\theta)=\frac{1}{z}p(x|\theta)p(\theta)\\$$
- So, θ and p can be sampled independently

Proposals

p ∼ N(0,1)
θ from path of the particle in both dimensions

Solve for path of particle from U & K

$$\frac{dP}{dt}=-\frac{\partial H}{\partial \theta}$$
$$\frac{d\theta}{dt}=-\frac{\partial H}{\partial P}$$
Continuously by small enough steps
For t amount of time
Flip sign of momentum after solving for last step of path
Generate proposals θ_new and P_new

Leapfrog integrator

sympletic -> oscillate around exact, handling error, built exactly for hamiltonian systems
for each step
- $P_{n+0.5} = P_n - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_n)$
- θ_n + 1 = θ_n + eP_n + 0.5
- $P_{n+1} = P_{n+.5} - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_{n+1})$

Acceptance ratio

$$A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)} \frac{p(\theta_{t-1},p_{t-1}|\theta_{new},p_{new})}{p(\theta_{new},p_{new}|\theta_{t-1},p_{t-1})}$$
- Paths are deterministic
  - so forward term = 1, but
  - reverse term = 1, (not intuitive)
$$\\ A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)}$$

Stan

Super powerful once you know the syntax (feels like R)
Full Bayesian inference and maximum likelihood estimation
Input your model and parameters into the correct sections
- Everything is converted to run in C

Summary

MCMC is (mostly) easy
MCMC Metropolis Hastings is the general case
- Proposal algorithm can vary depending on your needs
To ensure robust sampling:
- Check your sampling, update your procedure
- Use Large sample sizes
- Apply Thinning
- Discard burn-in
- Use the correct proposal algorithm for the job
Hamiltonian MCMC is fantastic for moderate to complex models
- Worth checking out Stan

New

Importance samling
Tempered Gibbs sampling
Tempering
exploration vs exploitation
asymptotic variance

single and double flip update