Practical Bayesian Inference¶

Adrian Brasoveanu, 11/12/2019¶

Based on materials from:

"Practical" means that we will move pretty fast through the foundations needed for the tutorial one week from today.

Bayes Introduction / Review¶

What is a probability?

Frequentist: measures likelihood of different outcomes of an uncertain but infinitely repeatable process.

e.g., given a coin, measure probability of H using a large number of trials.

Bayesian: a probability can be assigned to any statement.

This necessarily requires some subjective input when you cannot simply measure frequencies.

Roughly speaking, the two extremes are:

objective probabilities of uninteresting statements.
subjective probabilities of interesting statements.

In practice, any sensible methodology gives consistent answers with sufficiently informative data.

Whatever you consider your valid probabilities, $P(A)$ and $P(B)$, for (outcomes or statements) $A$ and $B$, you can always define the conditional probabilities:

$$ P(A\mid B) \equiv \frac{P(A\,\text{and}\,B)}{P(B)} \quad,\quad P(B\mid A) \equiv \frac{P(A\,\text{and}\,B)}{P(A)} $$

($A$ and $B$ are random variables or logical propositions about possible outcomes)

Bayes' rule is then just a consequence of this definition and entirely uncontroversial:

$$ \boxed{ P(A\mid B) = \frac{P(B\mid A) P(A)}{P(B)} } $$

Bayesian statistics starts from a joint probability distribution

$$ P(D, \Theta_M, M) $$

over data $D$, model parameters $\Theta_M$ and hyperparameters $M$.

The subscript on $\Theta_M$ is to remind us that, in general, the set of parameters being used depends on the hyperparameters.

Bayesian inference consists of evaluating:

$$ P(\Theta_M\mid D, M) = \frac{P(D\mid \Theta_M, M)\,P(\Theta_M\mid M)}{P(D\mid M)} $$

posterior: $P(\Theta_M\mid D, M)$
likelihood: $P(D\mid \Theta_M, M)$
prior: $P(\Theta_M\mid M)$
evidence (marginal likelihood): $P(D\mid M)$
data $D$ ~ features, parameters $\Theta_M$ ~ latent vars; we always condition on the hyperparameters $M$.

Example: infer the rate $\theta$ of a binomial process, e.g.:

the probability of heads for a coin
the probability of a student correctly answering a True/False question on a test
the probability that a diagnostic test will be positive

Let's fix:

the prior: $ \theta \sim \text{Beta}(1, 1)$ (uniform prior)
the likelihood $k\sim\text{Binomial}(\theta, n)$

Assume we get $k = 7$ successes out of $n = 10$ trials.

In [3]:

bayes_learn(k=7, n=10, prior='flat')

With a "flat prior", posterior = likelihood. The frequentist and Bayesian approaches give the same results.

What we learn after one failure, given a flat prior:

In [4]:

bayes_learn(k=0, n=1, prior='flat')

What we learn after one success, given a flat prior:

In [5]:

bayes_learn(k=1, n=1, prior='flat')

More realistically, we have prior information about the process. For example, we believe that success is relatively unlikely:

In [6]:

bayes_learn(k=7, n=10, prior='unlikely')

the posterior is a weighted average of the prior and the likelihood

And here are the one-failure and one-success updates with an informative prior:

In [7]:

bayes_learn(k=0, n=1, prior='unlikely')

In [8]:

bayes_learn(k=1, n=1, prior='unlikely')

Why is Bayesian inference hard?¶

We are interested in the posterior: $$ P(\Theta_M\mid D, M) = \frac{P(D\mid \Theta_M, M)\,P(\Theta_M\mid M)}{P(D\mid M)} $$

The likelihood and prior are generally computable, but the normalization constant $P(D\mid M)$ requires an integral over all possible parameters: $$ P(D\mid M) = \int d\Theta_M'\, P(D\mid \Theta_M', M)\,P(\Theta_M'\mid M) $$

In rare cases, the normalization constant can be calculated analytically.

In most cases, an approximate method is required. Two good options:

Markov-Chain Monte Carlo (MCMC), dating back to the 1950s; the widely used Hamiltonian Monte Carlo dates back to the late 1980s.
Variational Inference (VI), dating back to (at least) the 1970s.

The underlying assumptions and numerical algorithms involved (sampling and optimization) are fundamentally different, leading to different trade-offs between these methods.

Method 1: Analytical computation¶

Certain combinations of likelihood and prior functions give an evidence integral that can be performed analytically.

In some special cases, the resulting posterior has the same general functional form as the prior. We then say that the prior and likelihood are conjugate.

There are only a few useful cases like this, but they include some important distributions, e.g.

Gaussian + Gaussian
Poisson + Gamma
Bernoulli + Beta
Categorical + Dirichlet

The full list is here.

Method 2: Markov-Chain Monte Carlo¶

Markov-Chain Formalism¶

A stochastic process is basically a black-box generator of random sequences $$ x_0, x_1, x_2, \ldots $$

In general, the value $x_n$ depends on the history of all previous samples, so the black box has long-term memory.

A (first-order) Markov chain is a special case where $x_n$ only depends directly on $x_{n-1}$: a black box with very short-term memory.

A (time) homogeneous Markov chain uses the same 'rule' $P(X_n\mid X_{n-1})$ to generate $x_n$ from $x_{n-1}$ for all $n$.

A (time) reversible Markov chain has the same probabilistic 'rule' going in either direction, $P(X_n\mid X_{n-1}) = P(X_{n-1}\mid X_n)$.

In practice, a Markov chain is fully specified by two probabilistic 'rules':

$P(X_0)$: generate an initial value $x_0$.
$P(X_n\mid X_{n-1})$: Generate the next value $x_n$ from the previous value $x_{n-1}$.

Under certain conditions, a Markov chain converges to a unique equilibrium (limiting stationary) distribution, which can be thought of as the long-run distribution of the chain that can be reached no matter where we start.

For a reversible Markov chain, it is easier to check that it has an equilibrium distribution, and design the transition 'rule' $P(X_n\mid X_{n-1})$ so that the chain has a specific equilibrium distribution, namely:

the posterior distribution we want to estimate.

Let's define a reversible Markov chain, which is also an example of MCMC (from Kruschke's Doing Bayesian Data Analysis). The story goes as follows.

A politician visits a chain of islands as part of their political campaign.

the politician wants to visit each island in proportion to its population, i.e., visit islands with higher populations more often;
the politician does not know the population distribution, i.e., the proportion of the total population that lives on each island, because the politician does not know the total population of the islands (the normalizing constant);
but the politician can find out the (unnormalized) population of the island they are currently on, and the (unnormalized) population of adjacent islands.

Assume the politician starts on a random island, so $P(X_0)$ is discrete uniform.

The politician uses the following rule, that is, $P(X_n\mid X_{n-1})$, to determine which island to visit next:

each day, the politician chooses a neighboring island randomly, and compares the population there with the population of the current island;
if the neighboring island has a larger population, the politician goes there;
if the neighboring island has a smaller population, then the politician visits with probability $p=\frac{population_{neighbor}}{population_{current}}$; otherwise the politician stays on the same island.

After doing this for many days:

the politician ends up spending time on each island proportional to the population of each island
this is the (limiting) stationary distribution of the Markov chain

Thus, the chain estimates the distribution of island populations correctly.

Let's generate 10 islands with populations randomly generated between $10$ and $100$:

integers only, i.e., we sample from a discrete uniform

Assume this is population in thousands.

we bound the chain of islands with $0$s on either side
this ensures we always stay on the bounding islands whenever we try to move outside the chain of islands

In [10]:

islands = make_islands(10)
islands

Out[10]:

array([ 0, 93, 44, 32, 12, 45, 90, 22, 55, 73, 99,  0])

In [11]:

(islands/islands.sum()).round(2)[1:-1]

Out[11]:

array([0.16, 0.08, 0.06, 0.02, 0.08, 0.16, 0.04, 0.1 , 0.13, 0.18])

In [12]:

plot_islands(islands)

Let's randomly choose a starting island:

In [13]:

starting_island = np.random.randint(10)
starting_island

Out[13]:

Let's simulate 10000 days of our politician's campaign trail:

In [14]:

samples = hop(islands, start=starting_island, niter=10000)
samples[:50]

Out[14]:

array([ 6,  5,  5,  5,  6,  5,  5,  5,  6,  6,  6,  5,  5,  6,  7,  6,  6,
        5,  5,  5,  5,  6,  6,  6,  6,  6,  5,  6,  6,  6,  6,  6,  6,  6,
        5,  5,  6,  7,  8,  9, 10,  9, 10,  9, 10,  9, 10,  9,  8,  7])

Let's get the counts and the normalized counts (empirical probabilities) for each island:

In [15]:

data = np.bincount(samples)[1:]; print(data)
data = data/data.sum(); data.round(2)

[1811  874  605  218  830 1595  397  967 1212 1492]

Out[15]:

array([0.18, 0.09, 0.06, 0.02, 0.08, 0.16, 0.04, 0.1 , 0.12, 0.15])

In [16]:

sns.barplot(x=np.arange(1, len(data)+1), y=data);

The influence of the starting point is visible for the first 500 samples or so ...

In [17]:

data_first_500 = np.bincount(samples[:500])[1:]; print(data_first_500)
data_first_500 = data_first_500/data_first_500.sum(); print(data_first_500.round(2))
sns.barplot(x=np.arange(1, len(data)+1), y=data_first_500);

[156  89  58  14  54  82  11  16  13   7]
[0.31 0.18 0.12 0.03 0.11 0.16 0.02 0.03 0.03 0.01]

... but wanes fairly quickly after that:

In [18]:

data_next_1000 = np.bincount(samples[500:1500])[1:]; print(data_next_1000)
data_next_1000 = data_next_1000/data_next_1000.sum(); print(data_next_1000.round(2))
sns.barplot(x=np.arange(1, len(data)+1), y=data_next_1000);

[230 108  89  34 113 192  44  67  57  66]
[0.23 0.11 0.09 0.03 0.11 0.19 0.04 0.07 0.06 0.07]

For practical applications, there are two issues to deal with:

There is no way to know in advance how big $n$ needs to be to achieve equilibrium.
Given a Markov chain with an equilibrium distribution, we can generate samples from it, but how do we build a chain to sample a specific distribution $\tilde{P}(X)$?

The second issue requires solving an inverse problem, which is generally challenging.

However, there is a general family of methods to build a Markov chain with a desired probability density:

Metropolis-Hastings-Green
- Metropolis-Hastings
  - Metropolis
  - Gibbs
  - Hamiltonian

MHG is the most general, but the simpler MH contains the essential ideas so we will focus on that.

Metropolis-Hastings Updates¶

The MH algorithm relies on a proposal distribution $Q$ that is easier to sample than $\tilde{P}$.

(If you knew how to sample $\tilde{P}$ directly, you would not need MCMC!)

We often use a multivariate Gaussian for $Q$ since it is easy (and efficient) to sample from. Any proposal distribution is valid, but choosing a $Q$ "closer" to $\tilde{P}$ generally reaches the desired equilibrium faster.

The proposal distribution can either be used to update relative to the current state ("random walk"): $$ X_{n+1} - X_n \sim Q $$ or else to generate a new independent state each time: $$ X_{n+1} \sim Q \; . $$

During each update we evaluate a proposed move to $x_{n+1}$ by calculating the Hastings ratio, $$ r(x_{n+1}, x_n) \equiv \frac{\tilde{P}(x_{n+1})}{\tilde{P}(x_n)}\, \frac{Q(x_n\mid x_{n+1})}{Q(x_{n+1}\mid x_n)} \; , $$ where $\tilde{P}$ is the desired equilibrium distribution.

Since $\tilde{P}$ only appears in a ratio, it does not need to be normalized: this feature is why MCMC is useful for practical Bayesian inference.

MCMC still requires that you can calculate un-normalized values of $\tilde{P}(x)$.

$$ r(x_{n+1}, x_n) \equiv \frac{\tilde{P}(x_{n+1})}{\tilde{P}(x_n)}\, \frac{Q(x_n\mid x_{n+1})}{Q(x_{n+1}\mid x_n)} $$

In general, the Hastings ratio is $\ge 0$ but it can otherwise be arbitrarily large.

We always accept a proposed move when $r(x_{n+1}, x_n) \ge 1$. Otherwise, we accept it with a probability of $0\le r(x_{n+1}, x_n) < 1$.

When a proposed move is rejected, the update returns the original value, so repetitions of the same output will occur.

When the proposal distribution $Q$ is symmetric:

$Q(x_{n+1}\mid x_n) = Q(x_n\mid x_{n+1})$

e.g, uniform or Gaussian, the Hastings ratio simplifies to:

$r(x_{n+1}, x_n) \equiv \frac{\tilde{P}(x_{n+1})}{\tilde{P}(x_n)}$

This results in the simpler Metropolis algorithm.

Here's a generic implementation of the Metropolis algorithm:

In [19]:

def metropolis(start, target, proposal, niter):
    """start: starting position, target: the target distribution,
    proposal: the proposal distribution, niter: number of iterations"""
    current = start
    samples = [current] # recording starting value
    for i in range(niter):
        proposed = proposal(current) # propose new value
        p = min(target(proposed)/target(current), 1) # acceptance probability
        if np.random.random() < p: # accept or reject proposal
            current = proposed # if accepted, update current value
        samples.append(current) # record current value
    return samples

Applied to the politician's campaign trail:

In [20]:

target = lambda x: islands[x] # unnormalized distribution
proposal = lambda x: x + np.random.choice([-1, 1]) # symmetric
samples = metropolis(starting_island, target, proposal, 10000)
data = np.bincount(samples)[1:]; data = data/data.sum()
sns.barplot(x=np.arange(len(data)), y=data);

A more realistic example: Line Fit with MCMC¶

Consider fitting a straight line $y = m x + b$, with parameters $m$ and $b$, to data with two features $x$ and $y$.

Use the log-likelihood function:

$$ \log{\cal L}(m, b; D) = -\frac{N}{2}\log(2\pi\sigma_y^2) -\frac{1}{2\sigma_y^2} \sum_{i=1}^N\, (y_i - m x_i - b)^2 \; , $$

where the error in $y$, $\sigma_y$, is a fixed hyperparameter.

First generate some data on a straight line with measurement errors in $y$ (so our assumed model is correct):

In [21]:

gen = np.random.RandomState(seed=123)
N, m_true, b_true, sigy_true = 10, 0.5, -0.2, 0.1
x_data = gen.uniform(-1, +1, size=N)
y_data = m_true * x_data + b_true + gen.normal(scale=sigy_true, size=N)

plt.errorbar(x_data, y_data, sigy_true, fmt='o', markersize=5)
plt.plot([-1, +1], [-m_true+b_true,+m_true+b_true], 'r:')
plt.xlabel('x'); plt.ylabel('y');

Next, implement the log-likelihood function: $$ \log{\cal L}(m, b; D) = -\frac{N}{2}\log(2\pi\sigma_y^2) -\frac{1}{2\sigma_y^2} \sum_{i=1}^N\, (y_i - m x_i - b)^2 \; , $$

In [22]:

def loglike(x, y, m, b, sigy):
    norm = 0.5 * len(x) * np.log(2 * np.pi * sigy**2)
    return -0.5 * np.sum((y - m * x - b) ** 2) / sigy**2 - norm

Finally, generate some MCMC samples of the posterior $P(m, b\mid D, M)$ assuming uniform priors $P(b,m\mid \sigma_y) = 1$:

In [26]:

samples = MCMC(loglike, m=[m_true], b=[b_true]);
jointplot('m', 'b', samples)

We always require a starting point to generate MCMC samples. In the last example, we used the true parameter values as starting points:

m=[m_true], b=[b_true]

What happens if you chose different starting points?

In [27]:

samples2 = MCMC(loglike, m=[m_true+0.1], b=[b_true+0.1])

In [28]:

samples.describe(percentiles=[])

Out[28]:

	m	b
count	10000.000000	10000.000000
mean	0.531518	-0.163364
std	0.074150	0.032178
min	0.277887	-0.271737
50%	0.533564	-0.162615
max	0.799155	-0.039917

In [29]:

samples2.describe(percentiles=[])

Out[29]:

	m	b
count	10000.000000	10000.000000
mean	0.526889	-0.164017
std	0.071437	0.033185
min	0.287060	-0.263188
50%	0.524377	-0.164970
max	0.800168	-0.051169

The changes are small compared with the offsets ($\pm 0.1$) and the parameter uncertainties.

The inference above assumes flat priors for $m$ and $b$, but you can add any log-prior to our log-likelihood to change this.

For example, suppose our prior belief is that $0.4 \le m \le 0.7$:

In [31]:

samples = MCMC(loglike, m=[m_true, TopHat(0.4,0.7)], b=[b_true])
jointplot('m', 'b', samples)

Perhaps we also have a prior measurement that found $b = -0.20 \pm 0.02$ (in which case, the new data is not adding much information about $b$):

In [32]:

samples = MCMC(loglike,
               m=[m_true,TopHat(0.4,0.7)],
               b=[b_true,Gauss(-0.20,0.02)])
jointplot('m', 'b', samples)

Practical Advice for MCMC¶

It is tempting to assume that MCMC samples have desirable properties beyond their minimum guarantees, since this is often true, but avoid this temptation.

MCMC samples are only guaranteed to sample your target $\tilde{P}$ after some unknown number of samples:

Burn-in? Should I throw away the first $B$ samples to ensure that my chain is independent of its initial starting point?

No: There is no practical way to know how big $B$ should be. Instead, ensure that your starting point is reasonably probable (according to $\tilde{P}$) and use all samples. If you do not know how to chose a reasonably probably starting point, you need to solve a separate optimization problem before you are ready to use MCMC (which is notoriously inefficient at discovering new regions of high probability).

There are no useful guarantees about the correlation between $X_{n+k}$ and $X_n$ being small and, in general, you should assume that the consecutive samples in any stretch of the chain are highly correlated.

Thinning? Should I just keep every $T$-th sample so that my chain is uncorrelated?

No: There is no practical way to know in advance how big $T$ should be, and you can never get a better answer (for a fixed amount of computation) by throwing away valid information. Just accept that samples are correlated.

How long should your chain be?

You should ideally use empirical measurements to determine $k$ such that the autocorrelation $$ \frac{\langle (X_{n+k} - \mu) (X_n - \mu)\rangle}{\sigma^2} \simeq 0 \; , $$ where $\mu$ and $\sigma$ are the long-term mean and standard deviation of $\tilde{P}(X_n)$, then generate a chain whose length is at least 10-100 times this autocorrelation length $k$.

For Gibbs samplers, the Gelman & Rubin ($\hat{R}$) metric is also popular.

Which update rule should you use?

Determine which special cases apply to your target $\tilde{P}$, so you know which algorithms are possible.

Can you efficiently sample from a complete set of conditional distributions? If so, add Gibbs sampling to your list.
Can you compute all partial derivatives? If so, add Hamiltonian MC to your list.

There is no "best" algorithm, so you will need to benchmark your problem against the available methods.

Although it is instructive (and fun!) to implement simple update rules yourself, for serious work you should generally let someone else do the hard work for you by using an existing package.

Which package should you use?

Stan and PyMC3 (based on Theano) are the most widely used nowadays; they both implement HMC etc.; if you want to learn how to do Bayesian inference well for realistically complex models, you'll probably end up using both of them;
Edward (based on tensorflow) or Pyro (based on PyTorch) are more recent additions; possibly the next gen b/c of ease of GPU usage.

Method 3: Variational Inference¶

The essence of variational inference (VI) is to first define a family of PDFs that balance two competing criteria:

convenient for calculations, and
flexible enough to approximately match some unknown target PDF.

We then select the family member that is "closest" to the target.

In a Bayesian context, our target PDF is a posterior distribution.

(However VI is a more general technique for finding approximate PDFs.)

Kullback-Leibler Divergence¶

VI relies on a concept of "closeness" between two PDFs $q(\theta)$ and $p(\theta)$.

VI traditionally uses the Kullback Leibler (KL) divergence to measure the "closeness" of PDFs $q(\theta)$ and $p(\theta)$:

$$ \boxed{ \text{KL}( q \parallel p ) \equiv \int d\theta\, q(\theta)\, \log\frac{q(\theta)}{p(\theta)}} $$

KL divergence is not symmetric: $\text{KL}(q\parallel p) \neq \text{KL}(p\parallel q)$.

exchanging $q$ and $p$ in the integrand changes its value. This makes KL divergence an unusual measure of separation and means that it is not a true metric.

$\text{KL}(q\parallel p) = 0$ when $p=q$.

this is what we would expect for a useful measure of separation.

KL divergence is not bounded from above:

a PDF is always $\ge 0$ but not bounded from above.

KL divergence is always $\ge 0$:

nothing prevents $q < p$, so the integrand can be negative (due to the log) even with $p, q \ge 0$
but the integral is always $\ge 0$; the proof relies on Jensen's inequality.

Consider the distribution $p$ and the two approximating distributions $q_1$ and $q_2$.

Which one of the 2 distributions minimizes forward KL ($\text{KL}(p\parallel q)$), and which one minimizes reverse KL ($\text{KL}(q\parallel p)$)?

(We focused on reverse KL before.)

In [34]:

KL_best_fit()

Forward KL ($\text{KL}(p\parallel q)$) chooses a $q$ that covers all of $p$:

the difference between $p(\theta)$ and $q(\theta)$ is weighted by $p(\theta)$
when $p(\theta)=0$, it does not matter what the other term of the integrand is
- ok to have large differences between $p(\theta)$ and $q(\theta)$
- during optimization, $q(\theta)$ is ignored whenever $p(\theta) = 0$
if $p(\theta) > 0$, then the other term of the integrand contributes to the KL value, so during optimization, the difference between $p(\theta)$ and $q(\theta)$ will be minimized
forward KL is $0$ avoiding: it avoids regions where $q(\theta) = 0$ whenever $p(\theta) > 0$

In [35]:

KL_best_fit(('p', 'q_2'), ('p', 'q_1'))

Reverse KL ($\text{KL}(q\parallel p)$) chooses a $q$ that fits inside of $p$:

now $q(\theta)$ is the weight, so if $q(\theta)=0$, there is no penalty for ignoring the value $p(\theta)$
if $q(\theta)>0$, we minimize the difference between $p(\theta)$ and $q(\theta)$ during optimization
hence, it is better for $q$ to fit just some part of $p$, as long as $q$ is $0$ for the other parts of $p$
reverse KL is $0$ forcing: $q(\theta)$ is forced to be $0$ when $p(\theta) > 0$ because that reduces the KL integrand to $0$

In [36]:

KL_best_fit(('q_1', 'p'), ('q_2', 'p'))

Since $q$ is a PDF, KL divergence can also be written as a difference of expectation values over $q$: $$ \text{KL}( q \parallel p ) = \langle \log q(\theta)\rangle_q - \langle \log p(\theta)\rangle_q \; . $$

This turns out to be very useful since it allows KL divergence to be numerically estimated using samples from $q$.

We call this Stochastic Variational Inference (SVI).

We use the reverse KL for SVI because we generally cannot sample from $p$ but can pick $q$ that is easy to sample.

The KL divergence is a generic method to find the parameterized PDF $q(\theta,s)$ that "best" approximates some target PDF $p(\theta)$.

For Bayesian inference, the $p$ we care about is the posterior: $$ p(\theta) = P(\theta\mid D) = \frac{P(D\mid \theta)\, P(\theta)}{P(D)} \; . $$

Since we generally cannot calculate the evidence $P(D)$, a practical inference method should not require that we know its value.

The variational Bayesian inference method has three steps:

Define a family of PDFs $q(\theta; s)$ that approximate the true posterior $P(\theta\mid D)$.
Use optimization to find the value $s^\ast$ that, according to the KL divergence, best approximates the true posterior.
Use $q(\theta; s=s^\ast)$ as an approximation of the true posterior for calculating posterior expectations etc.

The main tradeoff is in picking the approximate PDF family $q$.

A more flexible choice will generally do a better job of approximating the true posterior, but also require more difficult calculations.

($\theta$ are called latent variables and $s$ variational parameters)

Two cases for approximating a distribution $p^{*}$ with a distribution $q$ in a family of distributions $Q$:

$p^{*}\notin Q$ (the most common case in practice)
$p^{*}\in Q$.

Plugging the posterior into the KL definition, we can rewrite: $$ \begin{aligned} \text{KL}(q\parallel p) &= \int d\theta\, q(\theta) \left[\log P(D) + \log\frac{q(\theta)}{P(\theta)} - \log P(D\mid\theta) \right] \\ &= \log P(D) + \text{KL}(q\parallel P(\theta)) - \int d\theta\, q(\theta) \log P(D\mid\theta) \; . \end{aligned} $$

The three terms on the right-hand side are:

The log of the evidence $P(D)$
The KL divergence of $q(\theta)$ with respect to the prior $P(\theta)$
The $q$-weighted log-likelihood of the data

$$ \text{KL}(q\parallel p) = \log P(D) + \text{KL}(q\parallel P(\theta)) - \int d\theta\, q(\theta) \log P(D\mid\theta) $$

The log of the evidence $\log P(D)$ is a constant offset in the sum, independent of $q$
The KL divergence term $\text{KL}(q\parallel P(\theta))$ is minimized when $q(\theta) = P(\theta)$, i.e., it drives $q$ to look like the prior
The log-likelihood term is minimized when $q(\theta)$ prefers parameters $\theta$ that explain the data
- this is the cross entropy of $q(\theta)$ and $P(D\mid\theta)$ (a familiar loss function), which is minimized when they are equal

Since $q$ is normalized, it can only increase the weight of certain $\theta$ values by decreasing the weight of others, so there is a competition between the last two terms.

exactly what we need for a useful learning rule that balances prior knowledge with the information gained from new data

Let the negative evidence lower bound (ELBO) be the KL divergence $\text{KL}(q\parallel p)$ up to the constant $\log P(D)$:

$$ \boxed{ -\mathrm{ELBO}(q) \equiv \text{KL}(q\parallel P(\theta)) - \int d\theta\, q(\theta) \log P(D\mid\theta) } $$

If we want to minimize $\text{KL}(q\parallel p)$, we only need to minimize $-\text{ELBO}(q)$, or maximize $\text{ELBO}(q)$), which does not depend on the evidence $P(D)$. This property is what makes VI practical for Bayesian inference.

The ELBO can also be evaluated in terms of expectation values,

$$ \begin{aligned} \mathrm{ELBO}(q) &= \int d\theta\, q(\theta) \log P(D\mid\theta) - \text{KL}(q\parallel P(\theta)) \\ &= \int d\theta\, q(\theta) \log P(D\mid\theta) - \int d\theta\, q(\theta) \log\frac{q(\theta)}{P(\theta)}\\ &= \int d\theta\, q(\theta) \log \left[ P(D\mid\theta) P(\theta)\right] - \int d\theta\, q(\theta) \log q(\theta)\\ &= \langle \log \left[ P(D\mid\theta) P(\theta)\right]\rangle_q - \langle \log q\rangle_q \end{aligned} $$

The practical significance of this fact is that we can estimate the ELBO using averages of known quantities calculated with (finite) samples drawn from $q$ (Monte Carlo integration with importance sampling).

Example¶

Suppose we observe 1D data $x$ that we model with a double exponential (aka Laplacian) likelihood and one unknown location parameter $\theta$: $$ P(x\mid \theta) = \frac{1}{2}\, e^{-|x - \theta|} \; . $$

The resulting likelihood for observations $D = \{x_i\}$ is: $$ P(D\mid\theta) = \prod_i P(x_i\mid\theta) \; , $$

Assume a Gaussian prior. The corresponding posterior $$ P(\theta\mid D) = \frac{P(D\mid\theta)\, P(\theta)}{P(D)} $$

is roughly Gaussian, so we use a family $q$ of Gaussians to approximate it, which have $\mu=0$ fixed, and $\sigma = s$ varying.

In [38]:

plot_ELBO(q='norm', q_scale_range=[0.05, 0.15],
          likelihood='laplace', prior='norm',
          theta_range=[-0.6, +0.6], n_data=100)

Practical Calculations with VI¶

MCMC with Metroplis-Hastings updates can be used as a black box for an arbitrary inference problem that only requires that you can calculate your likelihood $P(D\mid \theta)$ and prior $P(\theta)$ for arbitrary parameter values $\theta$.

VI, on the other hand, generally requires more work to setup for a particular problem, but is then often more computationally efficient since it replaces sampling with optimization.

A necessary step in any VI inference is to select an approximating family $q$, and this generally requires knowledge of the particular problem and some judgment on how to tradeoff calculational convenience against approximation error.

Once you selected a family $q(\theta; s)$ that is explored by some $s$, you need to be able to:

evaluate the -ELBO (proxy for KL) of $q(s)$ with respect to $p$ for any $s$, and
find the value of $s$ that minimizes the KL divergence.

The first step either requires an analytic integral over $\theta$ or a sufficiently accurate numerical approximation to the ELBO integral.

the expectation form of the ELBO provides a general-purpose numerical approximation (SVI): $$ \text{ELBO}(q) = \langle \log \left[ P(D\mid\theta) P(\theta)\right]\rangle_q - \langle \log q\rangle_q $$

For the second step, we can use standard numerical optimization methods (usually requiring us to evaluate derivatives of $q(s)$ with respect to $s$).

Recall that we use the reverse KL for VI since $q$ is (by choice) easy to sample from:

$$ \text{KL}( q \parallel p ) = \langle \log q(\theta)\rangle_q - \langle \log p(\theta)\rangle_q \; . $$

Contrast with MCMC where the algorithm provides samples of $p$ that allow us to estimate the forward KL:

$$ \text{KL}( p \parallel q ) = \langle \log p(\theta)\rangle_p - \langle \log q(\theta)\rangle_p \; . $$

Therefore:

VI finds $q$ that is closest to $p$ according to the reverse KL.
MCMC enables us to find $q$ that is closest to $p$ according to the forward KL.

(but once you have MCMC samples from $p$, we should use them directly instead of summarizing with an approximate $q$)