# MCMC

## MCMC

### What is MCMC

The biggest problem probabilistic in modeling is solving complex integrals

• MCMC = Method of Sampling
• Can be used to estimate moments of distribution, contours etc..
• Don’t need functional form of posterior
• Target distribution Doesn’t need to be normalized
• Alternatives

• Direct simulation (simple distributions only)
• Simple Bayesian -> Use conjugate prior (if posterior shape is reasonable)

These alternatives are more desirable, but generally not available for hierarchical models

### Intuition

• It simpler to implement than you may think.

#### Monte Carlo

• Method of obtaining samples from a distribution
• Generate a number from the uniform distribution
• Transform sample (transformation theory)
• MC alone works only if you know functional form of the distribution

#### Markov-Chains

• Directional Graph
• Nodes represents states
• Edges represent transitions
• Edge weights are transition probabilities
• Probability of being in a specific state is dependent upon last state

#### Markov-Chain Monte-Carlo

• The Markov chain is acyclic and generative
• states are previous proposals
• transitions probabilities are generated from state value
• next possible state is the same or generated from the current state

### Acceptance/Rejection algorithm

• Allows samples to be accepted from ares areas of higher probability
• MCMC without this = random walk
• Metropolis-Hastings is by far the most popular algorithm
• Not to be confused with ’proposal algorithms’

### Metropolis-Hastings

• General case of many different algorithms
• Metropolis
• Gibbs Sampler
• Hamiltonian MCMC

#### Initialization

• $$f(x)$$ is a density function proportional to the target distribution (ei doesn’t have to be normalized)
• eg. posterior density
• A. choose a starting value $$x_0$$
• B. choose a proposal (arbitrary) proposal distribution $$q$$
• poisson/normal/uniform are common -> “Metropolis Algorithm”
• perturbs the the current state of the chain

#### The Algorithm

• for i in 1:nSamples
1. $$x_{new} \sim q(x_{t-1})$$
2. calculate acceptance ratio. Probability of accepting $$x_{new}$$

$A=\frac{f(x_{new})}{f(x_{t-1})}\frac{q(x_{t-1}|x_{new})}{q(x_{new}|x_{t-1})}$
if proposal distribution is symmetric, proposal ratio = 1 -> metropolis algorithm

3. Generate $$u \sim \text{Uniform}(0,1)$$
• if $$u \le A$$,
• accept $$x_{new}$$
• else,
• reuse $$x_{t-1}$$

#### Choosing a proposal Distributions

• Different proposal distributions can be more efficient at exploring the target distribution
• Bimodal distribution example - may need way more samples to converge
• Consider model constraints/bounds
• Posterior distribution for variance parameter, samples never less than zero -> asymmetric proposal distribution
• Consider shape of target distribution
• Highly kurtotic -> should use a proposal distribution with higher variance

#### Basic Examples

##### Example Code: Metropolis - Normal Distribution, few samples
      nSamp <- 100

#INITIALIZATION
x = rep(0,nSamp+1)
x = 3

#ALGORITHM
for (i in 1:nSamp){

#PROPOSAL BASED UPON PREVIOUS X
P = rnorm(1,x[i],sd=1)

#ACCEPTANCE RATIO
A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )

if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
} else{
x[i+1] = P                   #REJECT OLD, ACCEPT NEW
}
}

yl=c(-5,5)
xseq <- seq(yl,yl,.01)
dtrue <- dnorm(xseq,mean=0,sd=1)

t <- c(1:(nSamp+1))
d.y <- density(x)

lmat <- c(1,2,3,3,
1,2,3,3,
1,2,3,3)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)

plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True') plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')

###### Example Output ##### Example: Metropolis - Normal Distribution, many samples
      nSamp <- 100000

#INITIALIZATION
x = rep(0,nSamp+1)
x = 3

#ALGORITHM
for (i in 1:nSamp){

#PROPOSAL BASED UPON PREVIOUS X
P = rnorm(1,x[i],sd=1)

#ACCEPTANCE RATIO
A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )

if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
} else{
x[i+1] = P                   #REJECT OLD, ACCEPT NEW
}
}

yl=c(-5,5)
xseq <- seq(yl,yl,.01)
dtrue <- dnorm(xseq,mean=0,sd=1)

t <- c(1:(nSamp+1))
d.y <- density(x)

lmat <- c(1,2,3,3,
1,2,3,3,
1,2,3,3)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)

plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True') plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')

###### Example Output ##### Example: Metropolis - Complex Bayes (2 logN observation, exponential prior)
      nSamp <- 100000

#INITIALIZATION
x = rep(0,nSamp+1)
x = 3

#ALGORITHM
for (i in 1:nSamp){

#PROPOSAL BASED UPON PREVIOUS X
P = rnorm(1,x[i],sd=1)

#ACCEPTANCE RATIO
A = ( dexp(P)^2 * dlnorm(P,meanlog=4,sdlog=2))/( dexp(x[i])^2 * dlnorm(x[i],meanlog=4,sdlog=2) )

if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
} else{
x[i+1] = P                   #REJECT OLD, ACCEPT NEW
}
}

yl=c(-5,5)
#xseq <- seq(yl,yl,.01)
#dtrue <- dnorm(xseq,mean=0,sd=1)

t <- c(1:(nSamp+1))
d.y <- density(x)

lmat <- c(1,2,2,
1,2,2,
1,2,2)
lmat <- matrix(lmat,nrow=3,byrow=TRUE)
layout(lmat)

#plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True') plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')

###### Example Output #### Issues

##### Starting in state of extremely low probability
• Solution - Discard burn in
• Burn in - Time it takes to stabilize
• can be arbitrary - make sure to check iteration
• generally 1000-5000
##### Sparse distribution areas
• Solutions
• Larger sample size
• Repeat with a range of different initial values
• Use a better suited proposal distribution/algorithm
##### Autocorrelation of samples from Markov Chain
• Solution - thinning
• keep every 3rd sample, discard others
##### Multivariate distributions

Curse of Dimensionality

• Solutions
• Use Gibbs sampler
• Use Hamiltonian MCMC

### Proposal algorithms

#### Metropolis

• Proposal is simply drawn from a uniform or normal distribution
• Hastings term = 1
• Discovered before MCMH, so naming is confusing

#### Gibbs sampler

##### When to use
• When you know the postrior distribution (required)
• With multivariate distributions -> breaks down proposal into smaller proposals according to dimension
• Conditional distribution of each variable is known and easy to sample from

#### Hamiltonian (variation of metropolis Hastings)

• Super great for Bayesian problems
• Only steps 1 & 2 are different
• … but very different
##### When to use
• Continuous distributions (required)
• Whenever possible…
• Stan
##### Benefits
• Reduces correlation between sampled states
• Converges much faster - immune to stuck states (torus distribution)
• “less random” proposals
##### Problems
• U-turns - samples will begin to look very similar after enough samples
• Use the “No U-Turn sampler” (NUTS) variant in Stan
• easier to use anyways
• Multi-modal distributions
• Still the best solution
• Can’t be used discrete distributions
##### Intuition

Use quantum particle physics as an analogy

###### 1. Think of probability density as inverse energy well
• Hockey puck analogy
• Particle will spend more time in the bottom of the well (lower energy states)
• Take advantage of this in order to be more efficient (grab more proposals from high probabilistic areas)
###### 2. Energy stats
• …are discrete
• If particle absorbs energy, it can jump states
• Particle is constantly absorbing and radiating
• Summary of energy states -> canoncial distribution
• $p(E_i) \propto e^{\frac{E_i}{T}} \text{, when Boltzmann constant = 1}$
• (Gibbs Canonical distribution)
• Take advantage of energy landscape in order to better explore explore energy landscape
• Using joint distributions
• Every time we accept new proposal of $$p$$, the system is at a different energy level
###### Relation of Hamiltonian to a posterior density
• Some (perhaps) familiar equations
• $p(E) \propto e^{\frac{E(\theta,P)}{T}}\\$
• $E(\theta,P)= H \equiv U(\theta) + K(P) = \text{const.}\\$
• $K(P)=\sum^d_{i=1}\frac{P_i^2}{2m}\\$
• $U(\theta) = -\log (p(x|\theta) p(\theta)) = \text{NLP}\\$
• when mass and $$T$$ = 1,
• $p(\theta,P) \propto p(x|\theta)p(\theta)e^\frac{P^2}{2}\\$
• $\equiv p(x|\theta)p(\theta)N(P|0,1)\\$
• Note:
• $\int p(\theta,P)\ dp = \frac{1}{z} p(x|\theta)p(\theta) \int N(P|0,1)\ dp\\$
• $p(\theta)=\frac{1}{z}p(x|\theta)p(\theta)\\$
• So, $$\theta$$ and $$p$$ can be sampled independently
##### Proposals
• $$p \sim N(0,1)$$
• $$\theta$$ from path of the particle in both dimensions
###### Solve for path of particle from U & K
• $\frac{dP}{dt}=-\frac{\partial H}{\partial \theta}$
• $\frac{d\theta}{dt}=-\frac{\partial H}{\partial P}$
• Continuously by small enough steps
• For $$t$$ amount of time
• Flip sign of momentum after solving for last step of path
• Generate proposals $$\theta_{new}$$ and $$P_{new}$$
Leapfrog integrator
• sympletic -> oscillate around exact, handling error, built exactly for hamiltonian systems
• for each step
• $$P_{n+0.5} = P_n - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_n)$$
• $$\theta_{n+1} = \theta_n + e P_{n+0.5}$$
• $$P_{n+1} = P_{n+.5} - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_{n+1})$$
##### Acceptance ratio
• $A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)} \frac{p(\theta_{t-1},p_{t-1}|\theta_{new},p_{new})}{p(\theta_{new},p_{new}|\theta_{t-1},p_{t-1})}$
• Paths are deterministic
• so forward term = 1, but
• reverse term = 1, (not intuitive)
• $\\ A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)}$
##### Stan
• Super powerful once you know the syntax (feels like R)
• Full Bayesian inference and maximum likelihood estimation
• Input your model and parameters into the correct sections
• Everything is converted to run in C

### Summary

• MCMC is (mostly) easy
• MCMC Metropolis Hastings is the general case
• Proposal algorithm can vary depending on your needs
• To ensure robust sampling:
• Use Large sample sizes
• Apply Thinning
• Use the correct proposal algorithm for the job
• Hamiltonian MCMC is fantastic for moderate to complex models
• Worth checking out Stan

## New

Importance samling
Tempered Gibbs sampling
Tempering
exploration vs exploitation
asymptotic variance

single and double flip update

Date: 2018-08-14

Created: 2020-07-25 Sat 14:03