# MCMC

## MCMC

### What is MCMC

The biggest problem probabilistic in modeling is solving complex integrals

- MCMC = Method of Sampling

- Can be used to estimate moments of distribution, contours etc..
- Don’t need functional form of posterior
- Target distribution Doesn’t need to be normalized

- Can be used to estimate moments of distribution, contours etc..
Alternatives

- Direct simulation (simple distributions only)
- Simple Bayesian -> Use conjugate prior (if posterior shape is reasonable)

These alternatives are more desirable, but generally not available for hierarchical models

- Direct simulation (simple distributions only)

### Intuition

- It simpler to implement than you may think.

#### Monte Carlo

- Method of obtaining samples from a distribution

- Generate a number from the uniform distribution
- Transform sample (transformation theory)

- Generate a number from the uniform distribution
- MC alone works only if you know functional form of the distribution

#### Markov-Chains

- Directional Graph

- Nodes represents states
- Edges represent transitions
- Edge weights are transition probabilities

- Nodes represents states
- Probability of being in a specific state is dependent upon last state

#### Markov-Chain Monte-Carlo

- The Markov chain is acyclic and generative

- states are previous proposals
- transitions probabilities are generated from state value
- next possible state is the same or generated from the current state

- states are previous proposals

### Acceptance/Rejection algorithm

- Allows samples to be accepted from ares areas of higher probability

- MCMC without this = random walk

- MCMC without this = random walk
- Metropolis-Hastings is by far the most popular algorithm

- Not to be confused with ’proposal algorithms’

- Not to be confused with ’proposal algorithms’

### Metropolis-Hastings

- General case of many different algorithms

- Metropolis
- Gibbs Sampler
- Hamiltonian MCMC
- …

- Metropolis

#### Initialization

- \(f(x)\) is a density function proportional to the target distribution (ei doesn’t have to be normalized)

- eg. posterior density

- eg. posterior density
- A. choose a starting value \(x_0\)
- B. choose a proposal (arbitrary) proposal distribution \(q\)

- poisson/normal/uniform are common -> “Metropolis Algorithm”
- perturbs the the current state of the chain

- poisson/normal/uniform are common -> “Metropolis Algorithm”

#### The Algorithm

- for i in 1:nSamples

- \(x_{new} \sim q(x_{t-1})\)
calculate acceptance ratio. Probability of accepting \(x_{new}\)

\[A=\frac{f(x_{new})}{f(x_{t-1})}\frac{q(x_{t-1}|x_{new})}{q(x_{new}|x_{t-1})}\]

if proposal distribution is symmetric, proposal ratio = 1 -> metropolis algorithm

- Generate \(u \sim \text{Uniform}(0,1)\)
- if \(u \le A\),

- accept \(x_{new}\)

- accept \(x_{new}\)
- else,

- reuse \(x_{t-1}\)

- reuse \(x_{t-1}\)

- if \(u \le A\),

- \(x_{new} \sim q(x_{t-1})\)

#### Choosing a proposal Distributions

- Different proposal distributions can be more efficient at exploring the target distribution

- Bimodal distribution example - may need way more samples to converge

- Bimodal distribution example - may need way more samples to converge
- Consider model constraints/bounds

- Posterior distribution for variance parameter, samples
**never**less than zero -> asymmetric proposal distribution

- Posterior distribution for variance parameter, samples
- Consider shape of target distribution

- Highly kurtotic -> should use a proposal distribution with higher variance

- Highly kurtotic -> should use a proposal distribution with higher variance

#### Basic Examples

##### Example Code: Metropolis - Normal Distribution, few samples

nSamp <- 100 #INITIALIZATION x = rep(0,nSamp+1) x[1] = 3 #ALGORITHM for (i in 1:nSamp){ #PROPOSAL BASED UPON PREVIOUS X P = rnorm(1,x[i],sd=1) #ACCEPTANCE RATIO A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) ) if (runif(1) > A){ #GENERATE UNIFORM PROBABILITY x[i+1] = x[i] #REJECT NEW, ACCEPT OLD } else{ x[i+1] = P #REJECT OLD, ACCEPT NEW } } yl=c(-5,5) xseq <- seq(yl[1],yl[2],.01) dtrue <- dnorm(xseq,mean=0,sd=1) t <- c(1:(nSamp+1)) d.y <- density(x) lmat <- c(1,2,3,3, 1,2,3,3, 1,2,3,3) lmat <- matrix(lmat,nrow=3,byrow=TRUE) layout(lmat) plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True') plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated') plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')

###### Example Output

##### Example: Metropolis - Normal Distribution, many samples

nSamp <- 100000 #INITIALIZATION x = rep(0,nSamp+1) x[1] = 3 #ALGORITHM for (i in 1:nSamp){ #PROPOSAL BASED UPON PREVIOUS X P = rnorm(1,x[i],sd=1) #ACCEPTANCE RATIO A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) ) if (runif(1) > A){ #GENERATE UNIFORM PROBABILITY x[i+1] = x[i] #REJECT NEW, ACCEPT OLD } else{ x[i+1] = P #REJECT OLD, ACCEPT NEW } } yl=c(-5,5) xseq <- seq(yl[1],yl[2],.01) dtrue <- dnorm(xseq,mean=0,sd=1) t <- c(1:(nSamp+1)) d.y <- density(x) lmat <- c(1,2,3,3, 1,2,3,3, 1,2,3,3) lmat <- matrix(lmat,nrow=3,byrow=TRUE) layout(lmat) plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True') plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated') plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')

###### Example Output

##### Example: Metropolis - Complex Bayes (2 logN observation, exponential prior)

nSamp <- 100000 #INITIALIZATION x = rep(0,nSamp+1) x[1] = 3 #ALGORITHM for (i in 1:nSamp){ #PROPOSAL BASED UPON PREVIOUS X P = rnorm(1,x[i],sd=1) #ACCEPTANCE RATIO A = ( dexp(P)^2 * dlnorm(P,meanlog=4,sdlog=2))/( dexp(x[i])^2 * dlnorm(x[i],meanlog=4,sdlog=2) ) if (runif(1) > A){ #GENERATE UNIFORM PROBABILITY x[i+1] = x[i] #REJECT NEW, ACCEPT OLD } else{ x[i+1] = P #REJECT OLD, ACCEPT NEW } } yl=c(-5,5) #xseq <- seq(yl[1],yl[2],.01) #dtrue <- dnorm(xseq,mean=0,sd=1) t <- c(1:(nSamp+1)) d.y <- density(x) lmat <- c(1,2,2, 1,2,2, 1,2,2) lmat <- matrix(lmat,nrow=3,byrow=TRUE) layout(lmat) #plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True') plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated') plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')

###### Example Output

#### Issues

##### Starting in state of extremely low probability

- Solution - Discard burn in

- Burn in - Time it takes to stabilize
- can be arbitrary - make sure to check iteration
- generally 1000-5000

- Burn in - Time it takes to stabilize

##### Sparse distribution areas

- Solutions

- Larger sample size
- Repeat with a range of different initial values
- Use a better suited proposal distribution/algorithm

- Larger sample size

##### Autocorrelation of samples from Markov Chain

- Solution - thinning

- keep every 3rd sample, discard others

- keep every 3rd sample, discard others

##### Multivariate distributions

Curse of Dimensionality

- Solutions

- Use Gibbs sampler
- Use Hamiltonian MCMC

- Use Gibbs sampler

### Proposal algorithms

#### Metropolis

- Proposal is simply drawn from a uniform or normal distribution
- Hastings term = 1
- Discovered before MCMH, so naming is confusing

#### Gibbs sampler

##### When to use

- When you know the postrior distribution (required)
- With multivariate distributions -> breaks down proposal into smaller proposals according to dimension
- Conditional distribution of each variable is known and easy to sample from

#### Hamiltonian (variation of metropolis Hastings)

- Super great for Bayesian problems
- Only steps 1 & 2 are different

- … but very different

- … but very different

##### When to use

- Continuous distributions (required)
- Whenever possible…
- Stan

##### Benefits

- Reduces correlation between sampled states
- Converges much faster - immune to stuck states (torus distribution)

- “less random” proposals

- “less random” proposals

##### Problems

- U-turns - samples will begin to look very similar after enough samples

- Use the “No U-Turn sampler” (NUTS) variant in Stan

- easier to use anyways

- easier to use anyways

- Use the “No U-Turn sampler” (NUTS) variant in Stan
- Multi-modal distributions

- Still the best solution

- Still the best solution
- Can’t be used discrete distributions

##### Intuition

Use quantum particle physics as an analogy

###### 1. Think of probability density as inverse energy well

- Hockey puck analogy

- Particle will spend more time in the bottom of the well (lower energy states)
- Take advantage of this in order to be more efficient (grab more proposals from high probabilistic areas)

- Particle will spend more time in the bottom of the well (lower energy states)

###### 2. Energy stats

- …are discrete

- If particle absorbs energy, it can jump states
- Particle is constantly absorbing and radiating

- If particle absorbs energy, it can jump states
- Summary of energy states -> canoncial distribution

- \[p(E_i) \propto e^{\frac{E_i}{T}} \text{, when Boltzmann constant = 1}\]
- (Gibbs Canonical distribution)

- \[p(E_i) \propto e^{\frac{E_i}{T}} \text{, when Boltzmann constant = 1}\]
- Take advantage of energy landscape in order to better explore explore energy landscape

- Using joint distributions

- Using joint distributions
- Every time we accept new proposal of \(p\), the system is at a different energy level

###### Relation of Hamiltonian to a posterior density

- Some (perhaps) familiar equations

- \[p(E) \propto e^{\frac{E(\theta,P)}{T}}\\\]
- \[E(\theta,P)= H \equiv U(\theta) + K(P) = \text{const.}\\\]
- \[K(P)=\sum^d_{i=1}\frac{P_i^2}{2m}\\\]
- \[U(\theta) = -\log (p(x|\theta) p(\theta)) = \text{NLP}\\\]

- \[p(E) \propto e^{\frac{E(\theta,P)}{T}}\\\]
- when mass and \(T\) = 1,

- \[p(\theta,P) \propto p(x|\theta)p(\theta)e^\frac{P^2}{2}\\\]

- \[\equiv p(x|\theta)p(\theta)N(P|0,1)\\\]

- \[\equiv p(x|\theta)p(\theta)N(P|0,1)\\\]

- \[p(\theta,P) \propto p(x|\theta)p(\theta)e^\frac{P^2}{2}\\\]
- Note:

- \[\int p(\theta,P)\ dp = \frac{1}{z} p(x|\theta)p(\theta) \int N(P|0,1)\ dp\\\]
- \[p(\theta)=\frac{1}{z}p(x|\theta)p(\theta)\\\]
- So, \(\theta\) and \(p\) can be sampled independently

- \[\int p(\theta,P)\ dp = \frac{1}{z} p(x|\theta)p(\theta) \int N(P|0,1)\ dp\\\]

##### Proposals

- \(p \sim N(0,1)\)
- \(\theta\) from path of the particle in both dimensions

###### Solve for path of particle from U & K

- \[\frac{dP}{dt}=-\frac{\partial H}{\partial \theta}\]
- \[\frac{d\theta}{dt}=-\frac{\partial H}{\partial P}\]
- Continuously by small enough steps
- For \(t\) amount of time
- Flip sign of momentum after solving for last step of path
- Generate proposals \(\theta_{new}\) and \(P_{new}\)

- sympletic -> oscillate around exact, handling error, built exactly for hamiltonian systems
- for each step

- \(P_{n+0.5} = P_n - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_n)\)
- \(\theta_{n+1} = \theta_n + e P_{n+0.5}\)
- \(P_{n+1} = P_{n+.5} - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_{n+1})\)

- \(P_{n+0.5} = P_n - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_n)\)

##### Acceptance ratio

- \[A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)} \frac{p(\theta_{t-1},p_{t-1}|\theta_{new},p_{new})}{p(\theta_{new},p_{new}|\theta_{t-1},p_{t-1})}\]

- Paths are deterministic

- so forward term = 1, but
- reverse term = 1, (not intuitive)

- so forward term = 1, but

- Paths are deterministic
- \[\\ A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)}\]

##### Stan

- Super powerful once you know the syntax (feels like R)
- Full Bayesian inference and maximum likelihood estimation
- Input your model and parameters into the correct sections

- Everything is converted to run in C

- Everything is converted to run in C

### Summary

- MCMC is (mostly) easy
- MCMC Metropolis Hastings is the general case

- Proposal algorithm can vary depending on your needs

- Proposal algorithm can vary depending on your needs
- To ensure robust sampling:

- Check your sampling, update your procedure
- Use Large sample sizes
- Apply Thinning
- Discard burn-in
- Use the correct proposal algorithm for the job

- Check your sampling, update your procedure
- Hamiltonian MCMC is fantastic for moderate to complex models

- Worth checking out Stan

- Worth checking out Stan

## New

Importance samling

Tempered Gibbs sampling

Tempering

exploration vs exploitation

asymptotic variance

single and double flip update