2018-08-14

# MCMC

## What is MCMC

The biggest problem probabilistic in modeling is solving complex integrals

• MCMC = Method of Sampling
• Can be used to estimate moments of distribution, contours etc..
• Don't need functional form of posterior
• Target distribution Doesn't need to be normalized
• Alternatives
• Direct simulation (simple distributions only)
• Simple Bayesian -> Use conjugate prior (if posterior shape is reasonable)
These alternatives are more desirable, but generally not available for hierarchical models

## Intuition

• It simpler to implement than you may think.

### Monte Carlo

• Method of obtaining samples from a distribution
• Generate a number from the uniform distribution
• Transform sample (transformation theory)
• MC alone works only if you know functional form of the distribution

### Markov-Chains

• Directional Graph
• Nodes represents states
• Edges represent transitions
• Edge weights are transition probabilities
• Probability of being in a specific state is dependent upon last state

### Markov-Chain Monte-Carlo

• The Markov chain is acyclic and generative
• states are previous proposals
• transitions probabilities are generated from state value
• next possible state is the same or generated from the current state

## Acceptance/Rejection algorithm

• Allows samples to be accepted from ares areas of higher probability
• MCMC without this = random walk
• Metropolis-Hastings is by far the most popular algorithm
• Not to be confused with 'proposal algorithms'

## Metropolis-Hastings

• General case of many different algorithms
• Metropolis
• Gibbs Sampler
• Hamiltonian MCMC

### Initialization

• f(x) is a density function proportional to the target distribution (ei doesn't have to be normalized)
• eg. posterior density
• A. choose a starting value x0
• B. choose a proposal (arbitrary) proposal distribution q
• poisson/normal/uniform are common -> "Metropolis Algorithm"
• perturbs the the current state of the chain

### The Algorithm

• for i in 1:nSamples

1. xnew ∼ q(xt − 1)

2. calculate acceptance ratio. Probability of accepting xnew

$$A=\frac{f(x_{new})}{f(x_{t-1})}\frac{q(x_{t-1}|x_{new})}{q(x_{new}|x_{t-1})}$$
if proposal distribution is symmetric, proposal ratio = 1 -> metropolis algorithm

3. Generate u ∼ Uniform(0,1)

• if u ≤ A,
• accept xnew
• else,
• reuse xt − 1

### Choosing a proposal Distributions

• Different proposal distributions can be more efficient at exploring the target distribution
• Bimodal distribution example - may need way more samples to converge
• Consider model constraints/bounds
• Posterior distribution for variance parameter, samples never less than zero -> asymmetric proposal distribution
• Consider shape of target distribution
• Highly kurtotic -> should use a proposal distribution with higher variance

### Basic Examples

#### Example Code: Metropolis - Normal Distribution, few samples

nSamp <- 100

#INITIALIZATION
x = rep(0,nSamp+1)
x = 3

#ALGORITHM
for (i in 1:nSamp){

#PROPOSAL BASED UPON PREVIOUS X
P = rnorm(1,x[i],sd=1)

#ACCEPTANCE RATIO
A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )

if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
} else{
x[i+1] = P                   #REJECT OLD, ACCEPT NEW
}
}

yl=c(-5,5)
xseq <- seq(yl,yl,.01)
dtrue <- dnorm(xseq,mean=0,sd=1)

t <- c(1:(nSamp+1))
d.y <- density(x)

lmat <- c(1,2,3,3,
1,2,3,3,
1,2,3,3)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)

plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True') plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
##### Example Output #### Example: Metropolis - Normal Distribution, many samples

nSamp <- 100000

#INITIALIZATION
x = rep(0,nSamp+1)
x = 3

#ALGORITHM
for (i in 1:nSamp){

#PROPOSAL BASED UPON PREVIOUS X
P = rnorm(1,x[i],sd=1)

#ACCEPTANCE RATIO
A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )

if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
} else{
x[i+1] = P                   #REJECT OLD, ACCEPT NEW
}
}

yl=c(-5,5)
xseq <- seq(yl,yl,.01)
dtrue <- dnorm(xseq,mean=0,sd=1)

t <- c(1:(nSamp+1))
d.y <- density(x)

lmat <- c(1,2,3,3,
1,2,3,3,
1,2,3,3)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)

plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True') plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
##### Example Output #### Example: Metropolis - Complex Bayes (2 logN observation, exponential prior)

nSamp <- 100000

#INITIALIZATION
x = rep(0,nSamp+1)
x = 3

#ALGORITHM
for (i in 1:nSamp){

#PROPOSAL BASED UPON PREVIOUS X
P = rnorm(1,x[i],sd=1)

#ACCEPTANCE RATIO
A = ( dexp(P)^2 * dlnorm(P,meanlog=4,sdlog=2))/( dexp(x[i])^2 * dlnorm(x[i],meanlog=4,sdlog=2) )

if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
} else{
x[i+1] = P                   #REJECT OLD, ACCEPT NEW
}
}

yl=c(-5,5)
#xseq <- seq(yl,yl,.01)
#dtrue <- dnorm(xseq,mean=0,sd=1)

t <- c(1:(nSamp+1))
d.y <- density(x)

lmat <- c(1,2,2,
1,2,2,
1,2,2)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)

#plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True') plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
##### Example Output ### Issues

#### Starting in state of extremely low probability

• Solution - Discard burn in
• Burn in - Time it takes to stabilize
• can be arbitrary - make sure to check iteration
• generally 1000-5000

#### Sparse distribution areas

• Solutions
• Larger sample size
• Repeat with a range of different initial values
• Use a better suited proposal distribution/algorithm

#### Autocorrelation of samples from Markov Chain

• Solution - thinning
• keep every 3rd sample, discard others

#### Multivariate distributions

Curse of Dimensionality

• Solutions
• Use Gibbs sampler
• Use Hamiltonian MCMC

## Proposal algorithms

### Metropolis

• Proposal is simply drawn from a uniform or normal distribution
• Hastings term = 1
• Discovered before MCMH, so naming is confusing

### Gibbs sampler

#### When to use

• When you know the postrior distribution (required)
• With multivariate distributions -> breaks down proposal into smaller proposals according to dimension
• Conditional distribution of each variable is known and easy to sample from

### Hamiltonian (variation of metropolis Hastings)

• Super great for Bayesian problems
• Only steps 1 & 2 are different
• … but very different

#### When to use

• Continuous distributions (required)
• Whenever possible…
• Stan

#### Benefits

• Reduces correlation between sampled states
• Converges much faster - immune to stuck states (torus distribution)
• "less random" proposals

#### Problems

• U-turns - samples will begin to look very similar after enough samples
• Use the "No U-Turn sampler" (NUTS) variant in Stan
• easier to use anyways
• Multi-modal distributions
• Still the best solution
• Can't be used discrete distributions

#### Intuition

Use quantum particle physics as an analogy

##### 1. Think of probability density as inverse energy well
• Hockey puck analogy
• Particle will spend more time in the bottom of the well (lower energy states)
• Take advantage of this in order to be more efficient (grab more proposals from high probabilistic areas)
##### 2. Energy stats
• …are discrete

• If particle absorbs energy, it can jump states
• Particle is constantly absorbing and radiating
• Summary of energy states -> canoncial distribution

• $$p(E_i) \propto e^{\frac{E_i}{T}} \text{, when Boltzmann constant = 1}$$
• (Gibbs Canonical distribution)
• Take advantage of energy landscape in order to better explore explore energy landscape

• Using joint distributions
• Every time we accept new proposal of p, the system is at a different energy level

##### Relation of Hamiltonian to a posterior density
• Some (perhaps) familiar equations

• $$p(E) \propto e^{\frac{E(\theta,P)}{T}}\\$$
• $$E(\theta,P)= H \equiv U(\theta) + K(P) = \text{const.}\\$$
• $$K(P)=\sum^d_{i=1}\frac{P_i^2}{2m}\\$$
• $$U(\theta) = -\log (p(x|\theta) p(\theta)) = \text{NLP}\\$$
• when mass and T = 1,

• $$p(\theta,P) \propto p(x|\theta)p(\theta)e^\frac{P^2}{2}\\$$
• $$\equiv p(x|\theta)p(\theta)N(P|0,1)\\$$
• Note:

• $$\int p(\theta,P)\ dp = \frac{1}{z} p(x|\theta)p(\theta) \int N(P|0,1)\ dp\\$$
• $$p(\theta)=\frac{1}{z}p(x|\theta)p(\theta)\\$$
• So, θ and p can be sampled independently

#### Proposals

• p ∼ N(0,1)
• θ from path of the particle in both dimensions
##### Solve for path of particle from U & K
• $$\frac{dP}{dt}=-\frac{\partial H}{\partial \theta}$$

• $$\frac{d\theta}{dt}=-\frac{\partial H}{\partial P}$$

• Continuously by small enough steps

• For t amount of time

• Flip sign of momentum after solving for last step of path

• Generate proposals θnew and Pnew

###### Leapfrog integrator
• sympletic -> oscillate around exact, handling error, built exactly for hamiltonian systems

• for each step

• $P_{n+0.5} = P_n - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_n)$

• θn + 1 = θn + ePn + 0.5

• $P_{n+1} = P_{n+.5} - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_{n+1})$

#### Acceptance ratio

• $$A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)} \frac{p(\theta_{t-1},p_{t-1}|\theta_{new},p_{new})}{p(\theta_{new},p_{new}|\theta_{t-1},p_{t-1})}$$

• Paths are deterministic
• so forward term = 1, but
• reverse term = 1, (not intuitive)
• $$\\ A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)}$$

#### Stan

• Super powerful once you know the syntax (feels like R)
• Full Bayesian inference and maximum likelihood estimation
• Input your model and parameters into the correct sections
• Everything is converted to run in C

## Summary

• MCMC is (mostly) easy
• MCMC Metropolis Hastings is the general case
• Proposal algorithm can vary depending on your needs
• To ensure robust sampling:
• Use Large sample sizes
• Apply Thinning