Uncertainty Aware Learning from Demonstrations in Multiple Contexts using Bayesian Neural Networks

Paper Status: Accepted to ICRA 2019
Link to the papercode, presentation, videos, poster.

Learning to make decisions in the real world is hard for a plethora of reasons. The decision-making architecture or controller has to not only find a good representation but also has to display a high degree of adaptability and versatility. This whole task is further made harder with the challenge of communication of the intent unambiguously and meaningfully. These problems shape our current AI research agendas that are aimed to let AI permeate and integrate deeply and effectively with human societies.

Learning from Demonstrations (LfD) has been one of the highly counted upon ways to create an effective AI system, primarily as it allows the robots and humans to communicate with each other relatively easily. In some cases, it reduces the difficult learning problem into the highly mature supervised learning problem. However, this only partially improves the situation as the world is unpredictably diverse and neural networks tend to extrapolate unwarrantedly. One solution to this problem could be letting the controllers know what they don’t know which will allow them to have an appropriate fallback strategy when their knowledge is insufficient to handle a given situation.

Fortunately enough for us, the field of mathematics has gifted us with a principled framework of Bayesian inference to capture the confidence in one’s capability to handle a situation in hand given the data it has seen so far. In this work, we leverage bayesian neural networks called Bayes by Backprop (BBB) [1] that fuse the powers of universal learnability of neural networks and Bayesian inference to generate decisions and the predictive uncertainty, to learn from demonstrations in a way that lends the controller to know what it doesn’t know. The demonstrations are fed in the form of state-action pairs, and the controllers in the form of BBB are made to learn them in the supervised regression learning fashion.

We would like first to see if the predictive uncertainty as the standard deviation of our BBB is consistently indicative of the decision-making quality of our controller or simply put reflect the confidence of the controller. For demonstrating this, we use diverse variants or contexts of real-robotic pendulum swing-up task where different contexts are systems with different pole-masses as shown in the table below.

our_robot_pendulum
We use a real robotic pendulum swing up in one of the experiments.
pendulum
Masses of the poles of different tasks of the real robotic pendulum

To have our BNN based controller detect unobservable changes in dynamics, we feed histories of a few recent time steps which we also call as temporal-windows. The figure below shows that the standard deviation of our BBB is consistently indicative of the decision-making quality of the controller. Videos corresponding to all the dots in the figure below can be found here.

real_context_6_to_real
For each pendulum swing-up context, standard deviation and episodic reward obtained during five independent runs are plotted against each other. Note the strong relationship between these quantities: higher standard deviation is correlated with lower episodic reward. Thus, the standard deviation can be used as a predictor for task success. Training was performed on task context 6 in this case.

Next, we show how a framework can be built on top of this ability to capture the task success with predictive uncertainty, that allows us to have an appropriate fallback strategy in place. We set up our experiments in an online fashion where the controller faces diverse contexts of HalfCheetah and Swimmer tasks of MuJoCo physics simulator. As one of the examples of many such fallback strategies, we let the controller seek and retrain on demonstrations specific to the context where it finds itself to be not confident in. This calls for defining an adaptive threshold that would ground the quantitative value of the predictive uncertainty based on what it already knows. We set this adaptive threshold as to be the average predictive uncertainty that the controller generates on all the contexts it has trained itself on so far. Now, at each context, the controller can seek more demonstrations if it finds its confidence below the adaptive threshold. The following two figures show how the uncertainties look like on each familiar and non familiar contexts of the Swimmer task and how does it change upon training on a non-familiar context.

visualization (1)
Visualization of how the adaptive threshold of the controller and the predictive uncertainties of the contexts change after training on context-specific demonstrations.

The plot below shows that with a low number of requests for context-specific demonstrations, our proposed controller achieves a level of performance that is similar to a naive controller that seeks demonstrations on all the contexts it faces. This figure corresponds to the experiments on various contexts of the HalfCheetah task. Note that this level of performance is way significantly higher than a random controller that never seeks for any demonstration.

3
The plots above show the number of demonstration requests made by various controller configurations (based on the values of c and m) vs the cumulative rewards obtained over all the contexts faced. The plot shows that performance close to a naive learner that seeks demonstrations on every context can be obtained by a lesser number of such requests. The results were obtained by averaging over 5 simulation runs with different randomized orderings of contexts. We suggest the readers look into the paper to know more about the hyperparameters c and m.

The above results are promising especially for professionals and enthusiasts that are looking forward to build complex control systems using demonstrations as demonstrations are expensive to obtain. You can find more insights about our approach in our paper.

Our future work includes developing theoretical performance bounds on such a probabilistic controller on the worst situations and devising better fallback strategies as not always training can be assumed as a sunk cost.

References:

[1] Blundell, C., Cornebise, J., Kavukcuoglu, K. and Wierstra, D., 2015. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s