Scaling up Robot Learning by Understanding Internet Videos

This NSF-funded project seeks to develop techniques to understand videos so as to scale up policy learning for robots. We are pursuing a four step approach:

Direct use of videos for policy learning is challenging. They lack action grounding, there is a mismatch in robot and human actions, goals and intents depicted in videos are not known, and demonstrated behavior may be sub-optimal. This motivates this project, where we develop an understanding of videos from the point of view of interaction, and learning techniques and policy architecture that can learn in spite of the aforementioned challenges. Successful completion of the project will lead to the development of navigation and manipulation policies that generalize well.


One-shot Visual Imitation via Attributed Waypoints and Demonstration Augmentation
Matthew Chang, Saurabh Gupta
International Conference on Robotics and Automation (ICRA), 2023

Abstract: In this paper, we analyze the behavior of existing techniques and design new solutions for the problem of one-shot visual imitation. In this setting, an agent must solve a novel instance of a novel task given just a single visual demonstration. Our analysis reveals that current methods fall short because of three errors: the DAgger problem arising from purely offline training, last centimeter errors in interacting with objects, and mis-fitting to the task context rather than to the actual task. This motivates the design of our modular approach where we a) separate out task inference (what to do) from task execution (how to do it), and b) develop data augmentation and generation techniques to mitigate mis-fitting. The former allows us to leverage hand-crafted motor primitives for task execution which side-steps the DAgger problem and last centimeter errors, while the latter gets the model to focus on the task rather than the task context. Our model gets 100 and 48 success rates on two recent benchmarks, improving upon the current state-of-the-art by absolute 90 and 20 respectively.

Human Hands as Probes for Interactive Object Understanding
Mohit Goyal, Sahil Modi, Rishabh Goyal, Saurabh Gupta
Computer Vision and Pattern Recognition (CVPR), 2022
webpage / code+data

Abstract: Interactive object understanding, or what we can do to objects and how is a long-standing goal of computer vision. In this paper, we tackle this problem through observation of human hands in in-the-wild egocentric videos. We demonstrate that observation of what human hands interact with and how can provide both the relevant data and the necessary supervision. Attending to hands, readily localizes and stabilizes active objects for learning and reveals places where interactions with objects occur. Analyzing the hands shows what we can do to objects and how. We apply these basic principles on the EPIC-KITCHENS dataset, and successfully learn state-sensitive features, and object affordances (regions of interaction and afforded grasps), purely by observing hands in egocentric videos.

Learning Value Functions from Undirected State-only Experience
Matthew Chang*, Arjun Gupta*, Saurabh Gupta
International Conference on Learning Representations (ICLR), 2022
Deep Reinforcement Learning Workshop at NeurIPS, 2021
Offline Reinforcement Learning Workshop at NeurIPS, 2021
webpage / arxiv link / code

Abstract: This paper tackles the problem of learning value functions from undirected state-only experience (state transitions without action labels i.e. (s,s’,r) tuples). We first theoretically characterize the applicability of Q-learning in this setting. We show that tabular Q-learning in discrete Markov decision processes (MDPs) learns the same value function under any arbitrary refinement of the action space. This theoretical result motivates the design of Latent Action Q-learning or LAQ, an offline RL method that can learn effective value functions from state-only experience. Latent Action Q-learning (LAQ) learns value functions using Q-learning on discrete latent actions obtained through a latent-variable future prediction model. We show that LAQ can recover value functions that have high correlation with value functions learned using ground truth actions. Value functions learned using LAQ lead to sample efficient acquisition of goal-directed behavior, can be used with domain-specific low-level controllers, and facilitate transfer across embodiments. Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the benefits of LAQ over simpler alternatives, imitation learning oracles, and competing methods.

Learned Visual Navigation for Under-Canopy Agricultural Robots
Arun Sivakumar, Sahil Modi, Mateus Gasparino, Che Ellis, Andres Velasquez, Girish Chowdhary*, Saurabh Gupta*
Robotics: Science and Systems (RSS), 2021

Abstract: This paper describes a system for visually guided autonomous navigation of under-canopy farm robots. Low-cost under-canopy robots can drive between crop rows under the plant canopy and accomplish tasks that are infeasible for over-the-canopy drones or larger agricultural equipment. However, autonomously navigating them under the canopy presents a number of challenges: unreliable GPS and LiDAR, high cost of sensing, challenging farm terrain, clutter due to leaves and weeds, and large variability in appearance over the season and across crop types. We address these challenges by building a modular system that leverages machine learning for robust and generalizable perception from monocular RGB images from low-cost cameras, and model predictive control for accurate control in challenging terrain. Our system, CropFollow, is able to autonomously drive 485 meters per intervention on average, outperforming a state-of-the-art LiDAR based system (286 meters per intervention) in extensive field testing spanning over 25 km.

Semantic Visual Navigation by Watching YouTube Videos
Matthew Chang, Arjun Gupta, Saurabh Gupta
Neural Information Processing Systems (NeurIPS), 2020
arxiv link / webpage / video / code

Abstract: Semantic cues and statistical regularities in real-world environment layouts can improve efficiency for navigation in novel environments. This paper learns and leverages such semantic cues for navigating to objects of interest in novel environments, by simply watching YouTube videos. This is challenging because YouTube videos do not come with labels for actions or goals, and may not even showcase optimal behavior. Our method tackles these challenges through the use of Q-learning on pseudo-labeled transition quadruples (image, action, next image, reward). We show that such off-policy Q-learning from passive data is able to learn meaningful semantic cues for navigation. These cues, when used in a hierarchical navigation policy, lead to improved efficiency at the ObjectGoal task in visually realistic simulations. We observe a relative improvement of 15-83% over end-to-end RL, behavior cloning, and classical methods, while using minimal direct interaction.





This material is based upon work supported by the National Science Foundation under Grant No. IIS-2007035 (Project Title: Scaling up Robot Learning by Understanding Internet Videos, PI: Saurabh Gupta). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.