MCMC

MCMC

What is MCMC

The biggest problem probabilistic in modeling is solving complex integrals

  • MCMC = Method of Sampling
    • Can be used to estimate moments of distribution, contours etc..
    • Don't need functional form of posterior
    • Target distribution Doesn't need to be normalized
  • Alternatives

    • Direct simulation (simple distributions only)
    • Simple Bayesian -> Use conjugate prior (if posterior shape is reasonable)

    These alternatives are more desirable, but generally not available for hierarchical models

Intuition

  • It simpler to implement than you may think.

Monte Carlo

  • Method of obtaining samples from a distribution
    • Generate a number from the uniform distribution
    • Transform sample (transformation theory)
  • MC alone works only if you know functional form of the distribution

Markov-Chains

  • Directional Graph
    • Nodes represents states
    • Edges represent transitions
    • Edge weights are transition probabilities
  • Probability of being in a specific state is dependent upon last state

Markov-Chain Monte-Carlo

  • The Markov chain is acyclic and generative
    • states are previous proposals
    • transitions probabilities are generated from state value
    • next possible state is the same or generated from the current state

Acceptance/Rejection algorithm

  • Allows samples to be accepted from ares areas of higher probability
    • MCMC without this = random walk
  • Metropolis-Hastings is by far the most popular algorithm
    • Not to be confused with 'proposal algorithms'

Metropolis-Hastings

  • General case of many different algorithms
    • Metropolis
    • Gibbs Sampler
    • Hamiltonian MCMC

Initialization

  • \(f(x)\) is a density function proportional to the target distribution (ei doesn't have to be normalized)
    • eg. posterior density
  • A. choose a starting value \(x_0\)
  • B. choose a proposal (arbitrary) proposal distribution \(q\)
    • poisson/normal/uniform are common -> "Metropolis Algorithm"
    • perturbs the the current state of the chain

The Algorithm

  • for i in 1:nSamples
    1. \(x_{new} \sim q(x_{t-1})\)
    2. calculate acceptance ratio. Probability of accepting \(x_{new}\)

      \[A=\frac{f(x_{new})}{f(x_{t-1})}\frac{q(x_{t-1}|x_{new})}{q(x_{new}|x_{t-1})}\]
      if proposal distribution is symmetric, proposal ratio = 1 -> metropolis algorithm

    3. Generate \(u \sim \text{Uniform}(0,1)\)
      • if \(u \le A\),
        • accept \(x_{new}\)
      • else,
        • reuse \(x_{t-1}\)

Choosing a proposal Distributions

  • Different proposal distributions can be more efficient at exploring the target distribution
    • Bimodal distribution example - may need way more samples to converge
  • Consider model constraints/bounds
    • Posterior distribution for variance parameter, samples never less than zero -> asymmetric proposal distribution
  • Consider shape of target distribution
    • Highly kurtotic -> should use a proposal distribution with higher variance

Basic Examples

Example Code: Metropolis - Normal Distribution, few samples
nSamp <- 100

                    #INITIALIZATION
                    x = rep(0,nSamp+1)
                    x[1] = 3

                    #ALGORITHM
                    for (i in 1:nSamp){

                    #PROPOSAL BASED UPON PREVIOUS X
                    P = rnorm(1,x[i],sd=1)

                    #ACCEPTANCE RATIO
                    A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )

                    if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
                    x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
                    } else{
                    x[i+1] = P                   #REJECT OLD, ACCEPT NEW
                    }
                    }

                    yl=c(-5,5)
                    xseq <- seq(yl[1],yl[2],.01)
                    dtrue <- dnorm(xseq,mean=0,sd=1)

                    t <- c(1:(nSamp+1))
                    d.y <- density(x)

                    lmat <- c(1,2,3,3,
                    1,2,3,3,
                    1,2,3,3)
                    lmat <- matrix(lmat,nrow=3,byrow=TRUE)
                    layout(lmat)

                    plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
                    plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
                    plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
                  
Example Output

MCMCnrmFew.png

Example: Metropolis - Normal Distribution, many samples
nSamp <- 100000

                    #INITIALIZATION
                    x = rep(0,nSamp+1)
                    x[1] = 3

                    #ALGORITHM
                    for (i in 1:nSamp){

                    #PROPOSAL BASED UPON PREVIOUS X
                    P = rnorm(1,x[i],sd=1)

                    #ACCEPTANCE RATIO
                    A = ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )

                    if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
                    x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
                    } else{
                    x[i+1] = P                   #REJECT OLD, ACCEPT NEW
                    }
                    }

                    yl=c(-5,5)
                    xseq <- seq(yl[1],yl[2],.01)
                    dtrue <- dnorm(xseq,mean=0,sd=1)

                    t <- c(1:(nSamp+1))
                    d.y <- density(x)

                    lmat <- c(1,2,3,3,
                    1,2,3,3,
                    1,2,3,3)
                    lmat <- matrix(lmat,nrow=3,byrow=TRUE)
                    layout(lmat)

                    plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
                    plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
                    plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
                  
Example Output

MCMCnrmMany.png

Example: Metropolis - Complex Bayes (2 logN observation, exponential prior)
nSamp <- 100000

                    #INITIALIZATION
                    x = rep(0,nSamp+1)
                    x[1] = 3

                    #ALGORITHM
                    for (i in 1:nSamp){

                    #PROPOSAL BASED UPON PREVIOUS X
                    P = rnorm(1,x[i],sd=1)

                    #ACCEPTANCE RATIO
                    A = ( dexp(P)^2 * dlnorm(P,meanlog=4,sdlog=2))/( dexp(x[i])^2 * dlnorm(x[i],meanlog=4,sdlog=2) )

                    if  (runif(1) > A){            #GENERATE UNIFORM PROBABILITY
                    x[i+1] = x[i]                #REJECT NEW, ACCEPT OLD
                    } else{
                    x[i+1] = P                   #REJECT OLD, ACCEPT NEW
                    }
                    }

                    yl=c(-5,5)
                    #xseq <- seq(yl[1],yl[2],.01)
                    #dtrue <- dnorm(xseq,mean=0,sd=1)

                    t <- c(1:(nSamp+1))
                    d.y <- density(x)

                    lmat <- c(1,2,2,
                    1,2,2,
                    1,2,2)
                    lmat <- matrix(lmat,nrow=3,byrow=TRUE)
                    layout(lmat)

                    #plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
                    plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
                    plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
                  
Example Output

MCMCnrmComplex.png

Issues

Starting in state of extremely low probability
  • Solution - Discard burn in
    • Burn in - Time it takes to stabilize
    • can be arbitrary - make sure to check iteration
    • generally 1000-5000
Sparse distribution areas
  • Solutions
    • Larger sample size
    • Repeat with a range of different initial values
    • Use a better suited proposal distribution/algorithm
Autocorrelation of samples from Markov Chain
  • Solution - thinning
    • keep every 3rd sample, discard others
Multivariate distributions

Curse of Dimensionality

  • Solutions
    • Use Gibbs sampler
    • Use Hamiltonian MCMC

Proposal algorithms

Metropolis

  • Proposal is simply drawn from a uniform or normal distribution
  • Hastings term = 1
  • Discovered before MCMH, so naming is confusing

Gibbs sampler

When to use
  • When you know the postrior distribution (required)
  • With multivariate distributions -> breaks down proposal into smaller proposals according to dimension
  • Conditional distribution of each variable is known and easy to sample from

Hamiltonian (variation of metropolis Hastings)

  • Super great for Bayesian problems
  • Only steps 1 & 2 are different
    • … but very different
When to use
  • Continuous distributions (required)
  • Whenever possible…
  • Stan
Benefits
  • Reduces correlation between sampled states
  • Converges much faster - immune to stuck states (torus distribution)
    • "less random" proposals
Problems
  • U-turns - samples will begin to look very similar after enough samples
    • Use the "No U-Turn sampler" (NUTS) variant in Stan
      • easier to use anyways
  • Multi-modal distributions
    • Still the best solution
  • Can't be used discrete distributions
Intuition

Use quantum particle physics as an analogy

1. Think of probability density as inverse energy well
  • Hockey puck analogy
    • Particle will spend more time in the bottom of the well (lower energy states)
    • Take advantage of this in order to be more efficient (grab more proposals from high probabilistic areas)
2. Energy stats
  • …are discrete
    • If particle absorbs energy, it can jump states
    • Particle is constantly absorbing and radiating
  • Summary of energy states -> canoncial distribution
    • \[p(E_i) \propto e^{\frac{E_i}{T}} \text{, when Boltzmann constant = 1}\]
    • (Gibbs Canonical distribution)
  • Take advantage of energy landscape in order to better explore explore energy landscape
    • Using joint distributions
  • Every time we accept new proposal of \(p\), the system is at a different energy level
Relation of Hamiltonian to a posterior density
  • Some (perhaps) familiar equations
    • \[p(E) \propto e^{\frac{E(\theta,P)}{T}}\\\]
    • \[E(\theta,P)= H \equiv U(\theta) + K(P) = \text{const.}\\\]
    • \[K(P)=\sum^d_{i=1}\frac{P_i^2}{2m}\\\]
    • \[U(\theta) = -\log (p(x|\theta) p(\theta)) = \text{NLP}\\\]
  • when mass and \(T\) = 1,
    • \[p(\theta,P) \propto p(x|\theta)p(\theta)e^\frac{P^2}{2}\\\]
      • \[\equiv p(x|\theta)p(\theta)N(P|0,1)\\\]
  • Note:
    • \[\int p(\theta,P)\ dp = \frac{1}{z} p(x|\theta)p(\theta) \int N(P|0,1)\ dp\\\]
    • \[p(\theta)=\frac{1}{z}p(x|\theta)p(\theta)\\\]
    • So, \(\theta\) and \(p\) can be sampled independently
Proposals
  • \(p \sim N(0,1)\)
  • \(\theta\) from path of the particle in both dimensions
Solve for path of particle from U & K
  • \[\frac{dP}{dt}=-\frac{\partial H}{\partial \theta}\]
  • \[\frac{d\theta}{dt}=-\frac{\partial H}{\partial P}\]
  • Continuously by small enough steps
  • For \(t\) amount of time
  • Flip sign of momentum after solving for last step of path
  • Generate proposals \(\theta_{new}\) and \(P_{new}\)
Leapfrog integrator
  • sympletic -> oscillate around exact, handling error, built exactly for hamiltonian systems
  • for each step
    • \(P_{n+0.5} = P_n - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_n)\)
    • \(\theta_{n+1} = \theta_n + e P_{n+0.5}\)
    • \(P_{n+1} = P_{n+.5} - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_{n+1})\)
Acceptance ratio
  • \[A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)} \frac{p(\theta_{t-1},p_{t-1}|\theta_{new},p_{new})}{p(\theta_{new},p_{new}|\theta_{t-1},p_{t-1})}\]
    • Paths are deterministic
      • so forward term = 1, but
      • reverse term = 1, (not intuitive)
  • \[\\ A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)}\]
Stan
  • Super powerful once you know the syntax (feels like R)
  • Full Bayesian inference and maximum likelihood estimation
  • Input your model and parameters into the correct sections
    • Everything is converted to run in C

Summary

  • MCMC is (mostly) easy
  • MCMC Metropolis Hastings is the general case
    • Proposal algorithm can vary depending on your needs
  • To ensure robust sampling:
    • Check your sampling, update your procedure
    • Use Large sample sizes
    • Apply Thinning
    • Discard burn-in
    • Use the correct proposal algorithm for the job
  • Hamiltonian MCMC is fantastic for moderate to complex models
    • Worth checking out Stan

Date: 2018-08-14

Author: Dave White

Created: 2018-10-07 Sun 13:24

Validate