CAREER: Learning Predictive Models for Visual Navigation and Object Interaction

Efficiently moving around and interacting with objects in novel environments requires building expectations about people (e.g., which side will an oncoming person pass by), places (e.g., where are car keys likely to be in a home), and things (e.g., which way will a door open). However, manually building such expectations into decision-making systems is challenging. At the same time, machine learning has been shown to be successful in extracting representative patterns from training datasets in many related application domains. While the use of machine learning to learn predictive models for decision-making seems promising; the design choices (data sources, forms of supervision, architectures for the predictive models, and interaction of the predictive models with decision-making) are deeply intertwined. As part of this project, investigators will identify the precise aspects in which machine learning benefits navigation and object interaction; and co-design datasets, models, and learning algorithms to build systems that realize these benefits. The project will improve the state-of-the-art of predictive reasoning for navigation and object interaction by designing approaches that can leverage large-scale diverse data sources for training. Models, datasets, and systems developed in this project will advance navigation and mobile manipulation capabilities. These will enable practical downstream applications (e.g., assistive robots, telepresence), and open up avenues for follow-up research (e.g., human-robot interaction). The project will contribute to the education of students and the broader community through curriculum development, engagement in research projects, and accessible dissemination of research.

The project will co-design data collection methods, learning techniques, and policy architectures to enable large-scale learning of predictive models for people, places, and things for problems involving navigation and mobile manipulation. Investigators will tackle the following three research tasks: (1) designing predictive models for people, places, and objects that are necessary for decision making; (2) identifying data sources and generating supervision to learn these predictive models at-scale; and (3) hierarchical and modular policy architectures that effectively use the learned predictive models. Investigators will re-use existing sense-plan-control components (motion planners, feedback controllers) where applicable (e.g., motion in free space), and introduce learning in modules that require speculation (i.e., high-level decision-making modules, e.g., identifying promising directions for exploration, predicting where will an oncoming human go next, what is a good position to open a drawer from). Investigators will evaluate the effectiveness of proposed methods by comparing the efficiency of systems with and without predictive reasoning.


Opening Cabinets and Drawers in the Real World using a Commodity Mobile Manipulator
Arjun Gupta*, Michelle Zhang*, Rishik Sathua, Saurabh Gupta
arXiv, 2024

Abstract: Pulling open cabinets and drawers presents many difficult technical challenges in perception (inferring articulation parameters for objects from onboard sensors), planning (producing motion plans that conform to tight task constraints), and control (making and maintaining contact while applying forces on the environment). In this work, we build an end-to-end system that enables a commodity mobile manipulator (Stretch RE2) to pull open cabinets and drawers in diverse previously unseen real world environments. We conduct 4 days of real world testing of this system spanning 31 different objects from across 13 different real world environments. Our system achieves a success rate of 61% on opening novel cabinets and drawers in unseen environments zero-shot. An analysis of the failure modes suggests that errors in perception are the most significant challenge for our system. We will open source code and models for others to replicate and build upon our system.

GOAT: GO to Any Thing
Matthew Chang*, Theophile Gervet*, Mukul Khanna*, Sriram Yenamandra*, Dhruv Shah, So Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, Roozbeh Mottaghi, Jitendra Malik, Devendra Chaplot
Robotics: Science and Systems (RSS), 2024

Abstract: In deployment scenarios such as homes and warehouses, mobile robots are expected to autonomously navigate for extended periods, seamlessly executing tasks articulated in terms that are intuitively understandable by human operators. We present GO To Any Thing (GOAT), a universal navigation system capable of tackling these requirements with three key features: a) Multimodal: it can tackle goals specified via category labels, target images, and language descriptions, b) Lifelong: it benefits from its past experience in the same environment, and c) Platform Agnostic: it can be quickly deployed on robots with different embodiments. GOAT is made possible through a modular system design and a continually augmented instance-aware semantic memory that keeps track of the appearance of objects from different viewpoints in addition to category-level semantics. This enables GOAT to distinguish between different instances of the same category to enable navigation to targets specified by images and language descriptions. In experimental comparisons spanning over 90 hours in 9 different homes consisting of 675 goals selected across 200+ different object instances, we find GOAT achieves an overall success rate of 83%, surpassing previous methods and ablations by 32% (absolute improvement). GOAT improves with experience in the environment, from a 60% success rate at the first goal to a 90% success after exploration. In addition, we demonstrate that GOAT can readily be applied to downstream tasks such as pick and place and social navigation.

Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops
Aditya Prakash, Arjun Gupta, Saurabh Gupta
arXiv, 2023

Abstract: Objects undergo varying amounts of perspective distortion as they move across a camera's field of view. Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view. We note that ignoring this location information further exaggerates the inherent ambiguity in making 3D inferences from 2D images and can prevent models from even fitting to the training data. To mitigate this ambiguity, we propose Intrinsics-Aware Positional Encoding (KPE), which incorporates information about the location of crops in the image and camera intrinsics. Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D shapes of articulated objects on ARCTIC, show the benefits of KPE.

Push Past Green: Learning to Look Behind Plant Foliage by Moving It
Xiaoyu Zhang, Saurabh Gupta
Conference on Robot Learning (CoRL), 2023
webpage / code+data

Abstract: Autonomous agriculture applications (e.g., inspection, phenotyping, plucking fruits) require manipulating the plant foliage to look behind the leaves and the branches. Partial visibility, extreme clutter, thin structures, and unknown geometry and dynamics for plants make such manipulation challenging. We tackle these challenges through data-driven methods. We use self-supervision to train SRPNet, a neural network that predicts what space is revealed on execution of a candidate action on a given plant. We use SRPNet with the cross-entropy method to predict actions that are effective at revealing space beneath plant foliage. Furthermore, as SRPNet does not just predict how much space is revealed but also where it is revealed, we can execute a sequence of actions that incrementally reveal more and more space beneath the plant foliage. We experiment with a synthetic (vines) and a real plant (Dracaena) on a physical test-bed across 5 settings including 2 settings that test generalization to novel plant configurations. Our experiments reveal the effectiveness of our overall method, PPG, over a competitive hand-crafted exploration method, and the effectiveness of SRPNet over a hand-crafted dynamics model and relevant ablations.

Predicting Motion Plans for Articulating Everyday Objects
Arjun Gupta, Max Shepherd, Saurabh Gupta
International Conference on Robotics and Automation (ICRA), 2023
webpage / dataset

Abstract: Mobile manipulation tasks such as opening a door, pulling open a drawer, or lifting a toilet lid require constrained motion of the end-effector under environmental and task constraints. This, coupled with partial information in novel environments, makes it challenging to employ classical motion planning approaches at test time. Our key insight is to cast it as a learning problem to leverage past experience of solving similar planning problems to directly predict motion plans for mobile manipulation tasks in novel situations at test time. To enable this, we develop a simulator, ArtObjSim, that simulates articulated objects placed in real scenes. We then introduce SeqIK\(\+\theta_0\), a fast and flexible representation for motion plans. Finally, we learn models that use SeqIK\(\+\theta_0\) to quickly predict motion plans for articulating novel objects at test time. Experimental evaluation shows improved speed and accuracy at generating motion plans than pure search-based methods and pure learning methods.

Building Rearticulable Models for Arbitrary 3D Objects from 4D Point Clouds
Shaowei Liu, Saurabh Gupta*, Shenlong Wang*
Computer Vision and Pattern Recognition (CVPR), 2023
website / code

Abstract: We build rearticulable models for arbitrary everyday man-made objects containing an arbitrary number of parts that are connected together in arbitrary ways via 1 degreeof- freedom joints. Given point cloud videos of such everyday objects, our method identifies the distinct object parts, what parts are connected to what other parts, and the properties of the joints connecting each part pair. We do this by jointly optimizing the part segmentation, transformation, and kinematics using a novel energy minimization framework. Our inferred animatable models, enables retargeting to novel poses with sparse point correspondences guidance. We test our method on a new articulating robot dataset, and the Sapiens dataset with common daily objects, as well as real-world scans. Experiments show that our method outperforms two leading prior works on various metrics.




This material is based upon work supported by the National Science Foundation under Grant No. IIS-2143873 (Project Title: CAREER: Learning Predictive Models for Visual Navigation and Object Interaction , PI: Saurabh Gupta). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.