# The very Basics of Bayesian Neural Networks

By the virtue of its ability to approximate any function([1], [2]), Neural Network~(NN) based architectures have achieved massive success in learning complex input-output mappings from data. However, mere knowledge of the input-output mapping falls short on a lot of situations especially that need to integrate beliefs in the model or where data is limited. Bayesian Neural Networks (BNN) are NN whose weights or parameters are expressed as a distribution rather than a deterministic value and learned using Bayesian inference. Their innate potential to simultaneously learn complex non-linear functions from data and express uncertainties have lent them a major role in our pursuit to develop more capable AI. In this blog post, I will cover their significance where traditional (deterministic) NNs fall short. I will also walk the readers through their foundational concepts that are fundamental to all flavors of BNN that are available out there today.

What deterministic NNs lack:

Mere knowledge of the input-output mapping by a NN is inadequate when it is needed to gauge predictive uncertainty in their predictions. This can be important when the data available is limited. NNs have a propensity to overfit to the data it sees. This makes it do unwarranted extrapolation to the unseen space of interest (see figure below). This problem is glaringly apparent when the seen data does not span the whole space of interest. Hence it would be useful to generate predictive uncertainties that could reflect the confidence of the predictor about its prediction.

Why deterministic NNs lack what they lack:
The probabilistic explanation behind these downsides of deterministic NN is it tries to evaluate the maximum likelihood point estimates or MLE by maximizing the likelihood of the seen data given the parameters of the NN (denoted as $\boldsymbol{w}$) that is typically solved by using backpropagation.

$\boldsymbol{w}^{MLE} = \text{argmax}_{\boldsymbol{w}} \log P(\mathcal{D}|\boldsymbol{w})$

Such an optimization leads to overfitting of the NN to the seen data. Hence this fails to generalize([3]). One partial fix to this problem is instead of evaluating the MLE, evaluate maximum a posteriori point estimates or MAP which makes the NN relatively more resistant to overfitting.

$\boldsymbol{w}^{MAP} = \text{argmax}_{\boldsymbol{w}} \log P(\boldsymbol{w}|\mathcal{D})$

Using a Gaussian prior is equivalent to doing L2 regularization while using a Laplace prior is equivalent to L1 regularization. However, this does not guarantee against any unwarranted extrapolation as explained in the section above on what deterministic NNs lack.

Ideally, we would like to not just have predictions but also their uncertainty in the predictions in the light of the seen data and prior beliefs. This uncertainty ideally should be higher at points that are far away from the seen region than the points that are closer. Naturally, an even better solution would be to estimate the whole posterior distribution by doing a full Bayesian inference.

The fix:
Doing a full Bayesian inference uses Bayes rule in the light of seen data (denoted as $\mathcal{D}=\{(x_i, y_i)\}$) to estimate a full posterior distribution of the parameters. This is the underlying concept of BNN training.

The prediction step to compute output of the new samples, say $\boldsymbol{\Hat{x}}$ is done by taking an expectation of the output over the optimized posterior parameter distribution, say $P(\boldsymbol{w}^*|\mathcal{D})$ as

$P(\Hat{y}|\Hat{\boldsymbol{x}}) = \mathbb{E}_{P(\boldsymbol{w}^*|\mathcal{D})}P(\Hat{y}|\boldsymbol{\Hat{x}, \boldsymbol{w}^*})$

This expectation is equivalent to predicting by averaging an infinite number of NNs by weighing their prediction with their posterior probability. This leads to model averaging and hence imparting resistance to noise.

However, both the exact computation of the posterior and the prediction step as shown in the equations above are computationally intractable. Also, finding a form to differentiate with respect to parameters as distributions is not possible which is indispensable for backpropagation. Hence various ways to approximate this in the context of BNNs have been developed which yields us a wide variety of BNNs today([4], [5], [6], [7], [8], [9], [10], [11]).

Note that BNNs should be seen different from NNs that have distributions defined over their hidden units rather than on the parameters. The former is a way to choose suitable NNs (hence regularization and model averaging) while the latter is about expressing uncertainty about a particular observation.

Closing Remarks:
In conclusion, BNNs are useful for integrating and modeling uncertainties. Furthermore, they have also been shown to improve predictive performances([4], [14]) and do systematic exploration([13]). Recent advances in the field of deep learning and hardware allow us to approximate the relevant quantities scalably using off-the-shelf optimizers. The fundamental problems in developing BNNs or any probabilistic model are the intractable computations of the posterior distribution and their expectations. Hence we have to resort to their approximation. There are broadly two categories of methods of doing this approximation – stochastic (eg. Markov Chain Monte Carlo) and deterministic (eg. variational inference). For readers interested in knowing more about them, I would point to two resources.

1. Chapters 10 and 11 of the book Pattern Recognition and Machine Learning by Christopher Bishop,
2. Talk on Scalable Bayesian Inference by David Dunson during NeurIPS 2018, Montreal.

References:
[1] Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. “Multilayer feedforward networks are universal approximators.” Neural networks 2, no. 5 (1989): 359-366.

[2] Cybenko, George. “Approximations by superpositions of a sigmoidal function.” Mathematics of Control, Signals and Systems 2 (1989): 183-192.

[3] Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

[4] Blundell, Charles, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. “Weight uncertainty in neural networks.” arXiv preprint arXiv:1505.05424 (2015).

[5] Gal, Yarin, and Zoubin Ghahramani. “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.” In international conference on machine learning, pp. 1050-1059. 2016.

[6] Bui, Thang D., José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, and Richard E. Turner. “Training deep Gaussian processes using stochastic expectation propagation and probabilistic backpropagation.” arXiv preprint arXiv:1511.03405 (2015).

[7] Minka, Thomas P. “Expectation propagation for approximate Bayesian inference.” In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp. 362-369. Morgan Kaufmann Publishers Inc., 2001.

[8] Hernández-Lobato, José Miguel, and Ryan Adams. “Probabilistic backpropagation for scalable learning of bayesian neural networks.” In International Conference on Machine Learning, pp. 1861-1869. 2015.

[9] Neal, Radford M. Bayesian learning for neural networks. Vol. 118. Springer Science & Business Media, 2012.

[10] MacKay, David JC. “A practical Bayesian framework for backpropagation networks.” Neural computation 4, no. 3 (1992): 448-472.

[11] Jylänki, Pasi, Aapo Nummenmaa, and Aki Vehtari. “Expectation propagation for neural networks with sparsity-promoting priors.” The Journal of Machine Learning Research 15, no. 1 (2014): 1849-1901.

[12] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[13] Houthooft, Rein, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. “Curiosity-driven exploration in deep reinforcement learning via bayesian neural networks.” arXiv preprint arxiv.1605.09674 (2016).

[14] Yoon, Jaesik, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. “Bayesian Model-Agnostic Meta-Learning.” In Advances in Neural Information Processing Systems, pp. 7342-7352. 2018.