Bayesian Inference
Apr 15th, 2013Suppose we have a probabilistic model with observations $\boldsymbol x$ and hidden variables $\boldsymbol z$. We want to infer the hidden variables $\boldsymbol z$ that can best explain the observations $\boldsymbol x$. In the Bayesian framework, we treat parameters as hidden random variables as well.
Maximum Likelihood
The common way to perform inference is to compute the Maximum Likelihood estimate
\[\hat{\boldsymbol z}_{\text{ML}} = \mathop{\arg\max}_{\boldsymbol z} p(\boldsymbol x | \boldsymbol z)\]which is the set of hidden variables that maximizes the likelihood $p(\boldsymbol x | \boldsymbol z)$ of the observed data.
Maximum a posteriori
A Bayesian approach will assume some prior knowledge about the hidden variables: a distribution $p(\boldsymbol z)$ over the space of hidden variables $\boldsymbol z$. The likelihood $p(\boldsymbol x | \boldsymbol z)$ gets weighted by the prior knowledge $p(\boldsymbol z)$ to give the posterior $p(\boldsymbol z | \boldsymbol x)$:
\[\text{posterior} \ \propto \ \text{likelihood} \ \times \ \text{prior} \qquad \text{or here} \qquad p(\boldsymbol z | \boldsymbol x) \ \propto \ p(\boldsymbol x | \boldsymbol z) \ \times \ p(\boldsymbol z).\]The maximum a posteriori (MAP) estimate
\[\hat{\boldsymbol z}_{\text{MAP}} = \mathop{\arg\max}_z p(\boldsymbol x | \boldsymbol z) p(\boldsymbol z) = \mathop{\arg\max}_z p(\boldsymbol z | \boldsymbol x)\]is the set of hidden variables that maximizes the posterior. The main advantage of this approach is that in order to maximize $p(\boldsymbol z | \boldsymbol x)$, we don’t need to compute the normalizing constant
\[p(\boldsymbol x) = \int_{\boldsymbol z} p(\boldsymbol x | \boldsymbol z) p(\boldsymbol z) \text{d}\boldsymbol z\]whereas it is needed for the full Bayesian approach, as we will see in the next paragraph. This normalizer can become intractable when models get complicated.
The MAP method is therefore sometimes described as poor man’s Bayesian inference [Tzikas08] as this is a way of including prior knowledge without having to pay the expensive price of computing $p(\boldsymbol x)$.
Full Bayesian approach
In the full Bayesian approach, we don’t only want one estimate $\hat{\boldsymbol z}$ of the hidden variables, but the entire distribution over them, the posterior distribution $p(\boldsymbol z | \boldsymbol x)$.
Every time we pick one specific value for a parameter in a model, we are making an approximation. As we compose models, approximations get amplified at every layer of the model. Feeding the entire parameters distribution in the next layer, instead of a point estimate, will increase the added value of this layer to the model.
In this approach, the normalizer $p(\boldsymbol x)$ becomes intractable as the model gets more complex, making it impossible to compute the posterior distribution. Often, the best we can do is approximate the posterior.