The very Basics of Bayesian Neural Networks

By the virtue of its ability to approximate any function([1], [2]), Neural Network~(NN) based architectures have achieved massive success in learning complex input-output mappings from data. However, a mere knowledge of the input-output mapping falls short on a lot of situations especially that need to integrate beliefs in the model and where data is limited. Bayesian Neural Networks (BNN) are NN whose weights or parameters are expressed as a distribution rather than a deterministic value and learned using Bayesian inference. Their innate potential to simultaneously learn complex non-linear functions from data and express uncertainties have lent them a major role in our pursuit to develop more capable AI. In this blog post, I will cover their significance where traditional (deterministic) NNs fall short. I will also walk the readers through their foundational concepts that are fundamental to all flavours of BNN that are available out there today.

 

bayes_NN
Unlike deterministic Neural Networks(left) that have a fixed value of their parameters, Bayesian Neural Networks(right) has a distribution defined over them.

What deterministic NNs lack:
NNs turn out to be inadequate broadly under two situations:

  1. where models should have placeholders to integrate prior beliefs in the model and gauge predictive uncertainty in their predictions:: Imagine a task of teaching an agent to drive autonomously on a two-lane road. By allowing to integrate the prior knowledge that driving should be on the right side within the road boundaries will not only lead to fast learning convergence but will also preclude any catastrophic behaviour during the early stages of learning. Surely this is something that is not naturally doable with deterministic NNs.
  2. under limited availability of data:: NNs have a propensity to overfit to the data it sees. This makes it do unwarranted extrapolation to the unseen space of interest (see figure below). This problem is glaringly apparent when the seen data does not span the whole space of interest.
    extrapolation_graph
    The red line shows the typical extrapolation by any NN. Ideally, it should have also predicted an uncertainty measure that should be higher at farther places in the input space. Image used from [4].

Why deterministic NNs lack what they lack:
The probabilistic explanation behind these downsides of deterministic NN is it tries to evaluate the maximum likelihood point estimates or MLE by maximizing the likelihood of the seen data given the parameters of the NN (denoted as \boldsymbol{w}) that is typically solved by using backpropagation.

\boldsymbol{w}^{MLE} = \text{argmax}_{\boldsymbol{w}} \log P(\mathcal{D}|\boldsymbol{w})

Such an optimization leads to overfitting of the NN to the seen data. Hence this fails to generalize([3]). One partial fix to this problem is instead of evaluating the MLE, evaluate maximum a posteriori point estimates or MAP which makes the NN relatively more resistant to overfitting.

\boldsymbol{w}^{MAP} = \text{argmax}_{\boldsymbol{w}} \log P(\boldsymbol{w}|\mathcal{D})

Using a Gaussian prior is equivalent to doing L2 regularization while using a Laplace prior is equivalent to L1 regularization. However this also neither guarantee against any unwarranted extrapolation nor does it allow integration of beliefs in the model as explained in the section above on what deterministic NNs lack.

Ideally, we would like to NNs to not just make predictions but also reason about their uncertainty in their predictions in the light of the seen data and prior beliefs. This uncertainty ideally should be higher at points that are far away from the seen region than the points that are closer. Naturally, an even better solution would be to estimate the whole posterior distribution by doing a full Bayesian inference.

The fix:
Doing a full Bayesian inference uses Bayes rule in the light of seen data (denoted as \mathcal{D}=\{(x_i, y_i)\}) to estimate a full posterior distribution of the parameters. This is the underlying concept of BNN training.

Bayes_Rule

bayesian_statistics
Bayesian inference adjusts the beliefs about a distribution in the light of data or evidence

The prediction step to compute output of the new samples, say \boldsymbol{\Hat{x}} is done by taking an expectation of the output over the optimized posterior parameter distribution, say P(\boldsymbol{w}^*|\mathcal{D}) as

P(\Hat{y}|\Hat{\boldsymbol{x}}) = \mathbb{E}_{P(\boldsymbol{w}^*|\mathcal{D})}P(\Hat{y}|\boldsymbol{\Hat{x}, \boldsymbol{w}^*})

This expectation is equivalent to predicting by averaging an infinite number of NNs by weighing their prediction with their posterior probability. This leads to model averaging and hence imparting resistance to noise.

However, both the exact computation of the posterior and the prediction step as shown in the equations above are computationally intractable. Hence various ways to approximate this in the context of BNNs have been developed which yields us a wide variety of BNNs today([4], [5], [6], [7], [8], [9], [10], [11]).

Note that BNNs should be seen different from NNs that have distributions defined over their hidden units rather than on the parameters. The former is a way to choose suitable NNs (hence regularization and model averaging) while the latter is about expressing uncertainty about a particular observation.

Closing Remarks:
In conclusion, BNNs are useful for integrating and modelling uncertainties. Furthermore, they have also been shown to improve predictive performances([4], [14]) and do systematic exploration([13]). Recent advances in the field of deep learning and hardware allow us to approximate the relevant quantities scalably using off-the-shelf optimizers. The fundamental problems in developing BNNs or any probabilistic model are the intractable computations of the posterior distribution and their expectations. Hence we have to resort to their approximation. There are broadly two categories of methods of doing this approximation – stochastic (eg. Markov Chain Monte Carlo) and deterministic (eg. variational inference). For readers interested in knowing more about them, I would point to two resources.

  1. Chapters 10 and 11 of the book Pattern Recognition and Machine Learning by Christopher Bishop,
  2. Talk on Scalable Bayesian Inference by David Dunson during NeurIPS 2018, Montreal.

References:
[1] Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. “Multilayer feedforward networks are universal approximators.” Neural networks 2, no. 5 (1989): 359-366.

[2] Cybenko, George. “Approximations by superpositions of a sigmoidal function.” Mathematics of Control, Signals and Systems 2 (1989): 183-192.

[3] Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

[4] Blundell, Charles, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. “Weight uncertainty in neural networks.” arXiv preprint arXiv:1505.05424 (2015).

[5] Gal, Yarin, and Zoubin Ghahramani. “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.” In international conference on machine learning, pp. 1050-1059. 2016.

[6] Bui, Thang D., José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, and Richard E. Turner. “Training deep Gaussian processes using stochastic expectation propagation and probabilistic backpropagation.” arXiv preprint arXiv:1511.03405 (2015).

[7] Minka, Thomas P. “Expectation propagation for approximate Bayesian inference.” In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp. 362-369. Morgan Kaufmann Publishers Inc., 2001.

[8] Hernández-Lobato, José Miguel, and Ryan Adams. “Probabilistic backpropagation for scalable learning of bayesian neural networks.” In International Conference on Machine Learning, pp. 1861-1869. 2015.

[9] Neal, Radford M. Bayesian learning for neural networks. Vol. 118. Springer Science & Business Media, 2012.

[10] MacKay, David JC. “A practical Bayesian framework for backpropagation networks.” Neural computation 4, no. 3 (1992): 448-472.

[11] Jylänki, Pasi, Aapo Nummenmaa, and Aki Vehtari. “Expectation propagation for neural networks with sparsity-promoting priors.” The Journal of Machine Learning Research 15, no. 1 (2014): 1849-1901.

[12] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[13] Houthooft, Rein, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. “Curiosity-driven exploration in deep reinforcement learning via bayesian neural networks.” arXiv preprint arxiv.1605.09674 (2016).

[14] Yoon, Jaesik, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. “Bayesian Model-Agnostic Meta-Learning.” In Advances in Neural Information Processing Systems, pp. 7342-7352. 2018.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s