Based on materials from:
"Practical" means that we will move pretty fast through the foundations needed for the tutorial one week from today.
What is a probability?
Frequentist: measures likelihood of different outcomes of an uncertain but infinitely repeatable process.
Bayesian: a probability can be assigned to any statement.
This necessarily requires some subjective input when you cannot simply measure frequencies.
Roughly speaking, the two extremes are:
In practice, any sensible methodology gives consistent answers with sufficiently informative data.
Whatever you consider your valid probabilities, $P(A)$ and $P(B)$, for (outcomes or statements) $A$ and $B$, you can always define the conditional probabilities:
$$ P(A\mid B) \equiv \frac{P(A\,\text{and}\,B)}{P(B)} \quad,\quad P(B\mid A) \equiv \frac{P(A\,\text{and}\,B)}{P(A)} $$($A$ and $B$ are random variables or logical propositions about possible outcomes)
Bayes' rule is then just a consequence of this definition and entirely uncontroversial:
$$ \boxed{ P(A\mid B) = \frac{P(B\mid A) P(A)}{P(B)} } $$Bayesian statistics starts from a joint probability distribution
$$ P(D, \Theta_M, M) $$over data $D$, model parameters $\Theta_M$ and hyperparameters $M$.
The subscript on $\Theta_M$ is to remind us that, in general, the set of parameters being used depends on the hyperparameters.
Bayesian inference consists of evaluating:
$$ P(\Theta_M\mid D, M) = \frac{P(D\mid \Theta_M, M)\,P(\Theta_M\mid M)}{P(D\mid M)} $$Example: infer the rate $\theta$ of a binomial process, e.g.:
Let's fix:
Assume we get $k = 7$ successes out of $n = 10$ trials.
bayes_learn(k=7, n=10, prior='flat')
With a "flat prior", posterior = likelihood. The frequentist and Bayesian approaches give the same results.
What we learn after one failure, given a flat prior:
bayes_learn(k=0, n=1, prior='flat')
What we learn after one success, given a flat prior:
bayes_learn(k=1, n=1, prior='flat')
More realistically, we have prior information about the process. For example, we believe that success is relatively unlikely:
bayes_learn(k=7, n=10, prior='unlikely')
And here are the one-failure and one-success updates with an informative prior:
bayes_learn(k=0, n=1, prior='unlikely')
bayes_learn(k=1, n=1, prior='unlikely')
We are interested in the posterior: $$ P(\Theta_M\mid D, M) = \frac{P(D\mid \Theta_M, M)\,P(\Theta_M\mid M)}{P(D\mid M)} $$
The likelihood and prior are generally computable, but the normalization constant $P(D\mid M)$ requires an integral over all possible parameters: $$ P(D\mid M) = \int d\Theta_M'\, P(D\mid \Theta_M', M)\,P(\Theta_M'\mid M) $$
In rare cases, the normalization constant can be calculated analytically.
In most cases, an approximate method is required. Two good options:
The underlying assumptions and numerical algorithms involved (sampling and optimization) are fundamentally different, leading to different trade-offs between these methods.
Certain combinations of likelihood and prior functions give an evidence integral that can be performed analytically.
In some special cases, the resulting posterior has the same general functional form as the prior. We then say that the prior and likelihood are conjugate.
There are only a few useful cases like this, but they include some important distributions, e.g.
The full list is here.
A stochastic process is basically a black-box generator of random sequences $$ x_0, x_1, x_2, \ldots $$
In general, the value $x_n$ depends on the history of all previous samples, so the black box has long-term memory.
A (first-order) Markov chain is a special case where $x_n$ only depends directly on $x_{n-1}$: a black box with very short-term memory.
A (time) homogeneous Markov chain uses the same 'rule' $P(X_n\mid X_{n-1})$ to generate $x_n$ from $x_{n-1}$ for all $n$.
A (time) reversible Markov chain has the same probabilistic 'rule' going in either direction, $P(X_n\mid X_{n-1}) = P(X_{n-1}\mid X_n)$.
In practice, a Markov chain is fully specified by two probabilistic 'rules':
Under certain conditions, a Markov chain converges to a unique equilibrium (limiting stationary) distribution, which can be thought of as the long-run distribution of the chain that can be reached no matter where we start.
For a reversible Markov chain, it is easier to check that it has an equilibrium distribution, and design the transition 'rule' $P(X_n\mid X_{n-1})$ so that the chain has a specific equilibrium distribution, namely:
Let's define a reversible Markov chain, which is also an example of MCMC (from Kruschke's Doing Bayesian Data Analysis). The story goes as follows.
A politician visits a chain of islands as part of their political campaign.
Assume the politician starts on a random island, so $P(X_0)$ is discrete uniform.
The politician uses the following rule, that is, $P(X_n\mid X_{n-1})$, to determine which island to visit next:
After doing this for many days:
Thus, the chain estimates the distribution of island populations correctly.
Let's generate 10 islands with populations randomly generated between $10$ and $100$:
Assume this is population in thousands.
islands = make_islands(10)
islands
array([ 0, 93, 44, 32, 12, 45, 90, 22, 55, 73, 99, 0])
(islands/islands.sum()).round(2)[1:-1]
array([0.16, 0.08, 0.06, 0.02, 0.08, 0.16, 0.04, 0.1 , 0.13, 0.18])
plot_islands(islands)
Let's randomly choose a starting island:
starting_island = np.random.randint(10)
starting_island
6
Let's simulate 10000 days of our politician's campaign trail:
samples = hop(islands, start=starting_island, niter=10000)
samples[:50]
array([ 6, 5, 5, 5, 6, 5, 5, 5, 6, 6, 6, 5, 5, 6, 7, 6, 6,
5, 5, 5, 5, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 6, 6,
5, 5, 6, 7, 8, 9, 10, 9, 10, 9, 10, 9, 10, 9, 8, 7])
Let's get the counts and the normalized counts (empirical probabilities) for each island:
data = np.bincount(samples)[1:]; print(data)
data = data/data.sum(); data.round(2)
[1811 874 605 218 830 1595 397 967 1212 1492]
array([0.18, 0.09, 0.06, 0.02, 0.08, 0.16, 0.04, 0.1 , 0.12, 0.15])
sns.barplot(x=np.arange(1, len(data)+1), y=data);
The influence of the starting point is visible for the first 500 samples or so ...
data_first_500 = np.bincount(samples[:500])[1:]; print(data_first_500)
data_first_500 = data_first_500/data_first_500.sum(); print(data_first_500.round(2))
sns.barplot(x=np.arange(1, len(data)+1), y=data_first_500);
[156 89 58 14 54 82 11 16 13 7] [0.31 0.18 0.12 0.03 0.11 0.16 0.02 0.03 0.03 0.01]
... but wanes fairly quickly after that:
data_next_1000 = np.bincount(samples[500:1500])[1:]; print(data_next_1000)
data_next_1000 = data_next_1000/data_next_1000.sum(); print(data_next_1000.round(2))
sns.barplot(x=np.arange(1, len(data)+1), y=data_next_1000);
[230 108 89 34 113 192 44 67 57 66] [0.23 0.11 0.09 0.03 0.11 0.19 0.04 0.07 0.06 0.07]
For practical applications, there are two issues to deal with:
The second issue requires solving an inverse problem, which is generally challenging.
However, there is a general family of methods to build a Markov chain with a desired probability density:
MHG is the most general, but the simpler MH contains the essential ideas so we will focus on that.
The MH algorithm relies on a proposal distribution $Q$ that is easier to sample than $\tilde{P}$.
(If you knew how to sample $\tilde{P}$ directly, you would not need MCMC!)
We often use a multivariate Gaussian for $Q$ since it is easy (and efficient) to sample from. Any proposal distribution is valid, but choosing a $Q$ "closer" to $\tilde{P}$ generally reaches the desired equilibrium faster.
The proposal distribution can either be used to update relative to the current state ("random walk"): $$ X_{n+1} - X_n \sim Q $$ or else to generate a new independent state each time: $$ X_{n+1} \sim Q \; . $$
During each update we evaluate a proposed move to $x_{n+1}$ by calculating the Hastings ratio, $$ r(x_{n+1}, x_n) \equiv \frac{\tilde{P}(x_{n+1})}{\tilde{P}(x_n)}\, \frac{Q(x_n\mid x_{n+1})}{Q(x_{n+1}\mid x_n)} \; , $$ where $\tilde{P}$ is the desired equilibrium distribution.
Since $\tilde{P}$ only appears in a ratio, it does not need to be normalized: this feature is why MCMC is useful for practical Bayesian inference.
MCMC still requires that you can calculate un-normalized values of $\tilde{P}(x)$.
In general, the Hastings ratio is $\ge 0$ but it can otherwise be arbitrarily large.
We always accept a proposed move when $r(x_{n+1}, x_n) \ge 1$. Otherwise, we accept it with a probability of $0\le r(x_{n+1}, x_n) < 1$.
When a proposed move is rejected, the update returns the original value, so repetitions of the same output will occur.
When the proposal distribution $Q$ is symmetric:
e.g, uniform or Gaussian, the Hastings ratio simplifies to:
This results in the simpler Metropolis algorithm.
Here's a generic implementation of the Metropolis algorithm:
def metropolis(start, target, proposal, niter):
"""start: starting position, target: the target distribution,
proposal: the proposal distribution, niter: number of iterations"""
current = start
samples = [current] # recording starting value
for i in range(niter):
proposed = proposal(current) # propose new value
p = min(target(proposed)/target(current), 1) # acceptance probability
if np.random.random() < p: # accept or reject proposal
current = proposed # if accepted, update current value
samples.append(current) # record current value
return samples
Applied to the politician's campaign trail:
target = lambda x: islands[x] # unnormalized distribution
proposal = lambda x: x + np.random.choice([-1, 1]) # symmetric
samples = metropolis(starting_island, target, proposal, 10000)
data = np.bincount(samples)[1:]; data = data/data.sum()
sns.barplot(x=np.arange(len(data)), y=data);
Consider fitting a straight line $y = m x + b$, with parameters $m$ and $b$, to data with two features $x$ and $y$.
Use the log-likelihood function:
$$ \log{\cal L}(m, b; D) = -\frac{N}{2}\log(2\pi\sigma_y^2) -\frac{1}{2\sigma_y^2} \sum_{i=1}^N\, (y_i - m x_i - b)^2 \; , $$where the error in $y$, $\sigma_y$, is a fixed hyperparameter.
First generate some data on a straight line with measurement errors in $y$ (so our assumed model is correct):
gen = np.random.RandomState(seed=123)
N, m_true, b_true, sigy_true = 10, 0.5, -0.2, 0.1
x_data = gen.uniform(-1, +1, size=N)
y_data = m_true * x_data + b_true + gen.normal(scale=sigy_true, size=N)
plt.errorbar(x_data, y_data, sigy_true, fmt='o', markersize=5)
plt.plot([-1, +1], [-m_true+b_true,+m_true+b_true], 'r:')
plt.xlabel('x'); plt.ylabel('y');
Next, implement the log-likelihood function: $$ \log{\cal L}(m, b; D) = -\frac{N}{2}\log(2\pi\sigma_y^2) -\frac{1}{2\sigma_y^2} \sum_{i=1}^N\, (y_i - m x_i - b)^2 \; , $$
def loglike(x, y, m, b, sigy):
norm = 0.5 * len(x) * np.log(2 * np.pi * sigy**2)
return -0.5 * np.sum((y - m * x - b) ** 2) / sigy**2 - norm
Finally, generate some MCMC samples of the posterior $P(m, b\mid D, M)$ assuming uniform priors $P(b,m\mid \sigma_y) = 1$:
samples = MCMC(loglike, m=[m_true], b=[b_true]);
jointplot('m', 'b', samples)
We always require a starting point to generate MCMC samples. In the last example, we used the true parameter values as starting points:
m=[m_true], b=[b_true]
What happens if you chose different starting points?
samples2 = MCMC(loglike, m=[m_true+0.1], b=[b_true+0.1])
samples.describe(percentiles=[])
| m | b | |
|---|---|---|
| count | 10000.000000 | 10000.000000 |
| mean | 0.531518 | -0.163364 |
| std | 0.074150 | 0.032178 |
| min | 0.277887 | -0.271737 |
| 50% | 0.533564 | -0.162615 |
| max | 0.799155 | -0.039917 |
samples2.describe(percentiles=[])
| m | b | |
|---|---|---|
| count | 10000.000000 | 10000.000000 |
| mean | 0.526889 | -0.164017 |
| std | 0.071437 | 0.033185 |
| min | 0.287060 | -0.263188 |
| 50% | 0.524377 | -0.164970 |
| max | 0.800168 | -0.051169 |
The changes are small compared with the offsets ($\pm 0.1$) and the parameter uncertainties.
The inference above assumes flat priors for $m$ and $b$, but you can add any log-prior to our log-likelihood to change this.
For example, suppose our prior belief is that $0.4 \le m \le 0.7$:
samples = MCMC(loglike, m=[m_true, TopHat(0.4,0.7)], b=[b_true])
jointplot('m', 'b', samples)
Perhaps we also have a prior measurement that found $b = -0.20 \pm 0.02$ (in which case, the new data is not adding much information about $b$):
samples = MCMC(loglike,
m=[m_true,TopHat(0.4,0.7)],
b=[b_true,Gauss(-0.20,0.02)])
jointplot('m', 'b', samples)
It is tempting to assume that MCMC samples have desirable properties beyond their minimum guarantees, since this is often true, but avoid this temptation.
MCMC samples are only guaranteed to sample your target $\tilde{P}$ after some unknown number of samples:
Burn-in? Should I throw away the first $B$ samples to ensure that my chain is independent of its initial starting point?
There are no useful guarantees about the correlation between $X_{n+k}$ and $X_n$ being small and, in general, you should assume that the consecutive samples in any stretch of the chain are highly correlated.
Thinning? Should I just keep every $T$-th sample so that my chain is uncorrelated?
How long should your chain be?
You should ideally use empirical measurements to determine $k$ such that the autocorrelation $$ \frac{\langle (X_{n+k} - \mu) (X_n - \mu)\rangle}{\sigma^2} \simeq 0 \; , $$ where $\mu$ and $\sigma$ are the long-term mean and standard deviation of $\tilde{P}(X_n)$, then generate a chain whose length is at least 10-100 times this autocorrelation length $k$.
For Gibbs samplers, the Gelman & Rubin ($\hat{R}$) metric is also popular.
Which update rule should you use?
Determine which special cases apply to your target $\tilde{P}$, so you know which algorithms are possible.
There is no "best" algorithm, so you will need to benchmark your problem against the available methods.
Although it is instructive (and fun!) to implement simple update rules yourself, for serious work you should generally let someone else do the hard work for you by using an existing package.
Which package should you use?
The essence of variational inference (VI) is to first define a family of PDFs that balance two competing criteria:
We then select the family member that is "closest" to the target.
In a Bayesian context, our target PDF is a posterior distribution.
(However VI is a more general technique for finding approximate PDFs.)
VI relies on a concept of "closeness" between two PDFs $q(\theta)$ and $p(\theta)$.
VI traditionally uses the Kullback Leibler (KL) divergence to measure the "closeness" of PDFs $q(\theta)$ and $p(\theta)$:
$$ \boxed{ \text{KL}( q \parallel p ) \equiv \int d\theta\, q(\theta)\, \log\frac{q(\theta)}{p(\theta)}} $$KL divergence is not symmetric: $\text{KL}(q\parallel p) \neq \text{KL}(p\parallel q)$.
$\text{KL}(q\parallel p) = 0$ when $p=q$.
KL divergence is not bounded from above:
KL divergence is always $\ge 0$:
Consider the distribution $p$ and the two approximating distributions $q_1$ and $q_2$.
Which one of the 2 distributions minimizes forward KL ($\text{KL}(p\parallel q)$), and which one minimizes reverse KL ($\text{KL}(q\parallel p)$)?
(We focused on reverse KL before.)
KL_best_fit()
Forward KL ($\text{KL}(p\parallel q)$) chooses a $q$ that covers all of $p$:
KL_best_fit(('p', 'q_2'), ('p', 'q_1'))
Reverse KL ($\text{KL}(q\parallel p)$) chooses a $q$ that fits inside of $p$:
KL_best_fit(('q_1', 'p'), ('q_2', 'p'))
Since $q$ is a PDF, KL divergence can also be written as a difference of expectation values over $q$: $$ \text{KL}( q \parallel p ) = \langle \log q(\theta)\rangle_q - \langle \log p(\theta)\rangle_q \; . $$
This turns out to be very useful since it allows KL divergence to be numerically estimated using samples from $q$.
We call this Stochastic Variational Inference (SVI).
We use the reverse KL for SVI because we generally cannot sample from $p$ but can pick $q$ that is easy to sample.
The KL divergence is a generic method to find the parameterized PDF $q(\theta,s)$ that "best" approximates some target PDF $p(\theta)$.
For Bayesian inference, the $p$ we care about is the posterior: $$ p(\theta) = P(\theta\mid D) = \frac{P(D\mid \theta)\, P(\theta)}{P(D)} \; . $$
Since we generally cannot calculate the evidence $P(D)$, a practical inference method should not require that we know its value.
The variational Bayesian inference method has three steps:
The main tradeoff is in picking the approximate PDF family $q$.
A more flexible choice will generally do a better job of approximating the true posterior, but also require more difficult calculations.
($\theta$ are called latent variables and $s$ variational parameters)
Two cases for approximating a distribution $p^{*}$ with a distribution $q$ in a family of distributions $Q$:
Plugging the posterior into the KL definition, we can rewrite: $$ \begin{aligned} \text{KL}(q\parallel p) &= \int d\theta\, q(\theta) \left[\log P(D) + \log\frac{q(\theta)}{P(\theta)} - \log P(D\mid\theta) \right] \\ &= \log P(D) + \text{KL}(q\parallel P(\theta)) - \int d\theta\, q(\theta) \log P(D\mid\theta) \; . \end{aligned} $$
The three terms on the right-hand side are:
Since $q$ is normalized, it can only increase the weight of certain $\theta$ values by decreasing the weight of others, so there is a competition between the last two terms.
Let the negative evidence lower bound (ELBO) be the KL divergence $\text{KL}(q\parallel p)$ up to the constant $\log P(D)$:
$$ \boxed{ -\mathrm{ELBO}(q) \equiv \text{KL}(q\parallel P(\theta)) - \int d\theta\, q(\theta) \log P(D\mid\theta) } $$If we want to minimize $\text{KL}(q\parallel p)$, we only need to minimize $-\text{ELBO}(q)$, or maximize $\text{ELBO}(q)$), which does not depend on the evidence $P(D)$. This property is what makes VI practical for Bayesian inference.
The ELBO can also be evaluated in terms of expectation values,
$$ \begin{aligned} \mathrm{ELBO}(q) &= \int d\theta\, q(\theta) \log P(D\mid\theta) - \text{KL}(q\parallel P(\theta)) \\ &= \int d\theta\, q(\theta) \log P(D\mid\theta) - \int d\theta\, q(\theta) \log\frac{q(\theta)}{P(\theta)}\\ &= \int d\theta\, q(\theta) \log \left[ P(D\mid\theta) P(\theta)\right] - \int d\theta\, q(\theta) \log q(\theta)\\ &= \langle \log \left[ P(D\mid\theta) P(\theta)\right]\rangle_q - \langle \log q\rangle_q \end{aligned} $$The practical significance of this fact is that we can estimate the ELBO using averages of known quantities calculated with (finite) samples drawn from $q$ (Monte Carlo integration with importance sampling).
Suppose we observe 1D data $x$ that we model with a double exponential (aka Laplacian) likelihood and one unknown location parameter $\theta$: $$ P(x\mid \theta) = \frac{1}{2}\, e^{-|x - \theta|} \; . $$
The resulting likelihood for observations $D = \{x_i\}$ is: $$ P(D\mid\theta) = \prod_i P(x_i\mid\theta) \; , $$
Assume a Gaussian prior. The corresponding posterior $$ P(\theta\mid D) = \frac{P(D\mid\theta)\, P(\theta)}{P(D)} $$
is roughly Gaussian, so we use a family $q$ of Gaussians to approximate it, which have $\mu=0$ fixed, and $\sigma = s$ varying.
plot_ELBO(q='norm', q_scale_range=[0.05, 0.15],
likelihood='laplace', prior='norm',
theta_range=[-0.6, +0.6], n_data=100)
MCMC with Metroplis-Hastings updates can be used as a black box for an arbitrary inference problem that only requires that you can calculate your likelihood $P(D\mid \theta)$ and prior $P(\theta)$ for arbitrary parameter values $\theta$.
VI, on the other hand, generally requires more work to setup for a particular problem, but is then often more computationally efficient since it replaces sampling with optimization.
A necessary step in any VI inference is to select an approximating family $q$, and this generally requires knowledge of the particular problem and some judgment on how to tradeoff calculational convenience against approximation error.
Once you selected a family $q(\theta; s)$ that is explored by some $s$, you need to be able to:
The first step either requires an analytic integral over $\theta$ or a sufficiently accurate numerical approximation to the ELBO integral.
For the second step, we can use standard numerical optimization methods (usually requiring us to evaluate derivatives of $q(s)$ with respect to $s$).
Recall that we use the reverse KL for VI since $q$ is (by choice) easy to sample from:
$$ \text{KL}( q \parallel p ) = \langle \log q(\theta)\rangle_q - \langle \log p(\theta)\rangle_q \; . $$Contrast with MCMC where the algorithm provides samples of $p$ that allow us to estimate the forward KL:
$$ \text{KL}( p \parallel q ) = \langle \log p(\theta)\rangle_p - \langle \log q(\theta)\rangle_p \; . $$Therefore:
(but once you have MCMC samples from $p$, we should use them directly instead of summarizing with an approximate $q$)