2018-08-14

__The biggest problem probabilistic in modeling is solving complex integrals__

- MCMC = Method of Sampling
- Can be used to estimate moments of distribution, contours etc..

- Don't need functional form of posterior

- Target distribution Doesn't need to be normalized

- Can be used to estimate moments of distribution, contours etc..
- Alternatives
- Direct simulation (simple distributions only)

- Simple Bayesian -> Use conjugate prior (if posterior shape is reasonable)

- Direct simulation (simple distributions only)

- It simpler to implement than you may think.

__Method of obtaining samples from a distribution__- Generate a number from the uniform distribution

- Transform sample (transformation theory)

- Generate a number from the uniform distribution
- MC alone works only if you know functional form of the distribution

__Directional Graph__- Nodes represents states

- Edges represent transitions

- Edge weights are transition probabilities

- Nodes represents states
- Probability of being in a specific state is dependent upon last state

- The Markov chain is acyclic and generative
- states are previous proposals

- transitions probabilities are generated from state value

- next possible state is the same or generated from the current state

- states are previous proposals

- Allows samples to be accepted from ares areas of higher probability
- MCMC without this = random walk

- MCMC without this = random walk
- Metropolis-Hastings is by far the most popular algorithm
- Not to be confused with 'proposal algorithms'

- Not to be confused with 'proposal algorithms'

- General case of many different algorithms
- Metropolis

- Gibbs Sampler

- Hamiltonian MCMC

- …

- Metropolis

*f*(*x*) is a density function proportional to the target distribution (ei doesn't have to be normalized)- eg. posterior density

- A. choose a starting value
*x*_{0}

- B. choose a proposal (arbitrary) proposal distribution
*q*- poisson/normal/uniform are common -> "Metropolis Algorithm"

- perturbs the the current state of the chain

- poisson/normal/uniform are common -> "Metropolis Algorithm"

for i in 1:nSamples

*x*_{new}∼*q*(*x*_{t − 1})

calculate acceptance ratio. Probability of accepting

*x*_{new}$$A=\frac{f(x_{new})}{f(x_{t-1})}\frac{q(x_{t-1}|x_{new})}{q(x_{new}|x_{t-1})}$$

if proposal distribution is symmetric, proposal ratio = 1 -> metropolis algorithmGenerate

*u*∼ Uniform(0,1)

- if
*u*≤*A*,- accept
*x*_{new}

- accept
- else,
- reuse
*x*_{t − 1}

- reuse

- if

- Different proposal distributions can be more efficient at exploring the target distribution
- Bimodal distribution example - may need way more samples to converge

- Bimodal distribution example - may need way more samples to converge
- Consider model constraints/bounds
- Posterior distribution for variance parameter, samples
**never**less than zero -> asymmetric proposal distribution

- Posterior distribution for variance parameter, samples
- Consider shape of target distribution
- Highly kurtotic -> should use a proposal distribution with higher variance

- Highly kurtotic -> should use a proposal distribution with higher variance

```
<- 100
nSamp
#INITIALIZATION
= rep(0,nSamp+1)
x 1] = 3
x[
#ALGORITHM
for (i in 1:nSamp){
#PROPOSAL BASED UPON PREVIOUS X
= rnorm(1,x[i],sd=1)
P
#ACCEPTANCE RATIO
= ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )
A
if (runif(1) > A){ #GENERATE UNIFORM PROBABILITY
+1] = x[i] #REJECT NEW, ACCEPT OLD
x[ielse{
} +1] = P #REJECT OLD, ACCEPT NEW
x[i
}
}
=c(-5,5)
yl<- seq(yl[1],yl[2],.01)
xseq <- dnorm(xseq,mean=0,sd=1)
dtrue
<- c(1:(nSamp+1))
t <- density(x)
d.y
<- c(1,2,3,3,
lmat 1,2,3,3,
1,2,3,3)
<- matrix(lmat,nrow=3,byrow=TRUE)
lmat layout(lmat)
plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
```

```
<- 100000
nSamp
#INITIALIZATION
= rep(0,nSamp+1)
x 1] = 3
x[
#ALGORITHM
for (i in 1:nSamp){
#PROPOSAL BASED UPON PREVIOUS X
= rnorm(1,x[i],sd=1)
P
#ACCEPTANCE RATIO
= ( dnorm(P,mean=0,sd=1))/( dnorm(x[i],mean=0,sd=1) )
A
if (runif(1) > A){ #GENERATE UNIFORM PROBABILITY
+1] = x[i] #REJECT NEW, ACCEPT OLD
x[ielse{
} +1] = P #REJECT OLD, ACCEPT NEW
x[i
}
}
=c(-5,5)
yl<- seq(yl[1],yl[2],.01)
xseq <- dnorm(xseq,mean=0,sd=1)
dtrue
<- c(1:(nSamp+1))
t <- density(x)
d.y
<- c(1,2,3,3,
lmat 1,2,3,3,
1,2,3,3)
<- matrix(lmat,nrow=3,byrow=TRUE)
lmat layout(lmat)
plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
```

```
<- 100000
nSamp
#INITIALIZATION
= rep(0,nSamp+1)
x 1] = 3
x[
#ALGORITHM
for (i in 1:nSamp){
#PROPOSAL BASED UPON PREVIOUS X
= rnorm(1,x[i],sd=1)
P
#ACCEPTANCE RATIO
= ( dexp(P)^2 * dlnorm(P,meanlog=4,sdlog=2))/( dexp(x[i])^2 * dlnorm(x[i],meanlog=4,sdlog=2) )
A
if (runif(1) > A){ #GENERATE UNIFORM PROBABILITY
+1] = x[i] #REJECT NEW, ACCEPT OLD
x[ielse{
} +1] = P #REJECT OLD, ACCEPT NEW
x[i
}
}
=c(-5,5)
yl#xseq <- seq(yl[1],yl[2],.01)
#dtrue <- dnorm(xseq,mean=0,sd=1)
<- c(1:(nSamp+1))
t <- density(x)
d.y
<- c(1,2,2,
lmat 1,2,2,
1,2,2)
<- matrix(lmat,nrow=3,byrow=TRUE)
lmat layout(lmat)
#plot(dtrue, xseq, ylim=yl, xlim=rev(range(d.y$y)), type='l',ylab='',xlab='True')
plot(d.y$y, d.y$x, ylim=yl, xlim=rev(range(d.y$y)), type='l',yaxt='n',ylab='',xlab='Generated')
plot(t,x,type="p",ylim=yl,pch=20,yaxt='n',ylab='')
```

- Solution - Discard burn in
- Burn in - Time it takes to stabilize

- can be arbitrary - make sure to check iteration

- generally 1000-5000

- Burn in - Time it takes to stabilize

- Solutions
- Larger sample size

- Repeat with a range of different initial values

- Use a better suited proposal distribution/algorithm

- Larger sample size

- Solution - thinning
- keep every 3rd sample, discard others

- keep every 3rd sample, discard others

Curse of Dimensionality

- Solutions
- Use Gibbs sampler

- Use Hamiltonian MCMC

- Use Gibbs sampler

- Proposal is simply drawn from a uniform or normal distribution

- Hastings term = 1

- Discovered before MCMH, so naming is confusing

- When you know the postrior distribution (required)

- With multivariate distributions -> breaks down proposal into smaller proposals according to dimension

- Conditional distribution of each variable is known and easy to sample from

- Super great for Bayesian problems

- Only steps 1 & 2 are different
- … but very different

- … but very different

- Continuous distributions (required)

- Whenever possible…

- Stan

- Reduces correlation between sampled states

- Converges much faster - immune to stuck states (torus distribution)
- "less random" proposals

- "less random" proposals

- U-turns - samples will begin to look very similar after enough samples
- Use the "No U-Turn sampler" (NUTS) variant in Stan
- easier to use anyways

- easier to use anyways

- Use the "No U-Turn sampler" (NUTS) variant in Stan
- Multi-modal distributions
- Still the best solution

- Still the best solution
- Can't be used discrete distributions

Use quantum particle physics as an analogy

- Hockey puck analogy
- Particle will spend more time in the bottom of the well (lower energy states)

- Take advantage of this in order to be more efficient (grab more proposals from high probabilistic areas)

- Particle will spend more time in the bottom of the well (lower energy states)

…are discrete

- If particle absorbs energy, it can jump states

- Particle is constantly absorbing and radiating

- If particle absorbs energy, it can jump states
Summary of energy states -> canoncial distribution

- $$p(E_i) \propto e^{\frac{E_i}{T}} \text{, when Boltzmann constant = 1}$$

- (Gibbs Canonical distribution)

- $$p(E_i) \propto e^{\frac{E_i}{T}} \text{, when Boltzmann constant = 1}$$
Take advantage of energy landscape in order to better explore explore energy landscape

- Using joint distributions

- Using joint distributions
Every time we accept new proposal of

*p*, the system is at a different energy level

Some (perhaps) familiar equations

- $$p(E) \propto e^{\frac{E(\theta,P)}{T}}\\$$

- $$E(\theta,P)= H \equiv U(\theta) + K(P) = \text{const.}\\$$

- $$K(P)=\sum^d_{i=1}\frac{P_i^2}{2m}\\$$

- $$U(\theta) = -\log (p(x|\theta) p(\theta)) = \text{NLP}\\$$

- $$p(E) \propto e^{\frac{E(\theta,P)}{T}}\\$$
when mass and

*T*= 1,- $$p(\theta,P) \propto p(x|\theta)p(\theta)e^\frac{P^2}{2}\\$$
- $$\equiv p(x|\theta)p(\theta)N(P|0,1)\\$$

- $$\equiv p(x|\theta)p(\theta)N(P|0,1)\\$$

- $$p(\theta,P) \propto p(x|\theta)p(\theta)e^\frac{P^2}{2}\\$$
Note:

- $$\int p(\theta,P)\ dp = \frac{1}{z} p(x|\theta)p(\theta) \int N(P|0,1)\ dp\\$$

- $$p(\theta)=\frac{1}{z}p(x|\theta)p(\theta)\\$$

- So,
*θ*and*p*can be sampled independently

- $$\int p(\theta,P)\ dp = \frac{1}{z} p(x|\theta)p(\theta) \int N(P|0,1)\ dp\\$$

*p*∼*N*(0,1)

*θ*from path of the particle in both dimensions

$$\frac{dP}{dt}=-\frac{\partial H}{\partial \theta}$$

$$\frac{d\theta}{dt}=-\frac{\partial H}{\partial P}$$

Continuously by small enough steps

For

*t*amount of time

Flip sign of momentum after solving for last step of path

Generate proposals

*θ*_{new}and*P*_{new}

sympletic -> oscillate around exact, handling error, built exactly for hamiltonian systems

for each step

$P_{n+0.5} = P_n - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_n)$

*θ*_{n + 1}=*θ*_{n}+*e**P*_{n + 0.5}$P_{n+1} = P_{n+.5} - \frac{e}{2} \frac{\partial U}{\partial \theta}(\theta_{n+1})$

$$A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)} \frac{p(\theta_{t-1},p_{t-1}|\theta_{new},p_{new})}{p(\theta_{new},p_{new}|\theta_{t-1},p_{t-1})}$$

- Paths are deterministic
- so forward term = 1, but

- reverse term = 1, (not intuitive)

- so forward term = 1, but

- Paths are deterministic
$$\\ A = \frac{p(x|\theta_{new})p(\theta_{new})N(p_{new}|0,1)}{p(x|\theta_{t-1})p(\theta_{t-1})N(p_{t-1}|0,1)}$$

- Super powerful once you know the syntax (feels like R)

- Full Bayesian inference and maximum likelihood estimation

- Input your model and parameters into the correct sections
- Everything is converted to run in C

- Everything is converted to run in C

- MCMC is (mostly) easy

- MCMC Metropolis Hastings is the general case
- Proposal algorithm can vary depending on your needs

- Proposal algorithm can vary depending on your needs
- To ensure robust sampling:
- Check your sampling, update your procedure

- Use Large sample sizes

- Apply Thinning

- Discard burn-in

- Use the correct proposal algorithm for the job

- Check your sampling, update your procedure
- Hamiltonian MCMC is fantastic for moderate to complex models
- Worth checking out Stan

- Worth checking out Stan

Importance samling

Tempered Gibbs sampling

Tempering

exploration vs exploitation

asymptotic variance

single and double flip update