Paper Status: Accepted to ICRA 2019
Link to the paper, code, presentation, videos, poster.
Learning to make decisions in the real world is hard for a plethora of reasons. The decision-making architecture or controller has to not only find a good representation but also has to display a high degree of adaptability and versatility. This whole task is further made harder with the challenge of communication of the intent unambiguously and meaningfully. These problems shape our current AI research agendas that are aimed to let AI permeate and integrate deeply and effectively with human societies.
Learning from Demonstrations (LfD) has been one of the highly counted upon ways to create an effective AI system, primarily as it allows the robots and humans to communicate with each other relatively easily. In some cases, it reduces the difficult learning problem into the highly mature supervised learning problem. However, this only partially improves the situation as the world is unpredictably diverse and neural networks tend to extrapolate unwarrantedly. One solution to this problem could be letting the controllers know what they don’t know which will allow them to have an appropriate fallback strategy when their knowledge is insufficient to handle a given situation.
Fortunately enough for us, the field of mathematics has gifted us with a principled framework of Bayesian inference to capture the confidence in one’s capability to handle a situation in hand given the data it has seen so far. In this work, we leverage bayesian neural networks called Bayes by Backprop (BBB) [1] that fuse the powers of universal learnability of neural networks and Bayesian inference to generate decisions and the predictive uncertainty, to learn from demonstrations in a way that lends the controller to know what it doesn’t know. The demonstrations are fed in the form of state-action pairs, and the controllers in the form of BBB are made to learn them in the supervised regression learning fashion.
We would like first to see if the predictive uncertainty as the standard deviation of our BBB is consistently indicative of the decision-making quality of our controller or simply put reflect the confidence of the controller. For demonstrating this, we use diverse variants or contexts of real-robotic pendulum swing-up task where different contexts are systems with different pole-masses as shown in the table below.


To have our BNN based controller detect unobservable changes in dynamics, we feed histories of a few recent time steps which we also call as temporal-windows. The figure below shows that the standard deviation of our BBB is consistently indicative of the decision-making quality of the controller. Videos corresponding to all the dots in the figure below can be found here.

Next, we show how a framework can be built on top of this ability to capture the task success with predictive uncertainty, that allows us to have an appropriate fallback strategy in place. We set up our experiments in an online fashion where the controller faces diverse contexts of HalfCheetah and Swimmer tasks of MuJoCo physics simulator. As one of the examples of many such fallback strategies, we let the controller seek and retrain on demonstrations specific to the context where it finds itself to be not confident in. This calls for defining an adaptive threshold that would ground the quantitative value of the predictive uncertainty based on what it already knows. We set this adaptive threshold as to be the average predictive uncertainty that the controller generates on all the contexts it has trained itself on so far. Now, at each context, the controller can seek more demonstrations if it finds its confidence below the adaptive threshold. The following two figures show how the uncertainties look like on each familiar and non familiar contexts of the Swimmer task and how does it change upon training on a non-familiar context.

The plot below shows that with a low number of requests for context-specific demonstrations, our proposed controller achieves a level of performance that is similar to a naive controller that seeks demonstrations on all the contexts it faces. This figure corresponds to the experiments on various contexts of the HalfCheetah task. Note that this level of performance is way significantly higher than a random controller that never seeks for any demonstration.

The above results are promising especially for professionals and enthusiasts that are looking forward to build complex control systems using demonstrations as demonstrations are expensive to obtain. You can find more insights about our approach in our paper.
Our future work includes developing theoretical performance bounds on such a probabilistic controller on the worst situations and devising better fallback strategies as not always training can be assumed as a sunk cost.
References:
[1] Blundell, C., Cornebise, J., Kavukcuoglu, K. and Wierstra, D., 2015. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424.