CAREER: Learning Predictive Models for Visual Navigation and Object Interaction

Efficiently moving around and interacting with objects in novel environments requires building expectations about people (e.g., which side will an oncoming person pass by), places (e.g., where are car keys likely to be in a home), and things (e.g., which way will a door open). However, manually building such expectations into decision-making systems is challenging. At the same time, machine learning has been shown to be successful in extracting representative patterns from training datasets in many related application domains. While the use of machine learning to learn predictive models for decision-making seems promising; the design choices (data sources, forms of supervision, architectures for the predictive models, and interaction of the predictive models with decision-making) are deeply intertwined. As part of this project, investigators will identify the precise aspects in which machine learning benefits navigation and object interaction; and co-design datasets, models, and learning algorithms to build systems that realize these benefits. The project will improve the state-of-the-art of predictive reasoning for navigation and object interaction by designing approaches that can leverage large-scale diverse data sources for training. Models, datasets, and systems developed in this project will advance navigation and mobile manipulation capabilities. These will enable practical downstream applications (e.g., assistive robots, telepresence), and open up avenues for follow-up research (e.g., human-robot interaction). The project will contribute to the education of students and the broader community through curriculum development, engagement in research projects, and accessible dissemination of research.

The project will co-design data collection methods, learning techniques, and policy architectures to enable large-scale learning of predictive models for people, places, and things for problems involving navigation and mobile manipulation. Investigators will tackle the following three research tasks: (1) designing predictive models for people, places, and objects that are necessary for decision making; (2) identifying data sources and generating supervision to learn these predictive models at-scale; and (3) hierarchical and modular policy architectures that effectively use the learned predictive models. Investigators will re-use existing sense-plan-control components (motion planners, feedback controllers) where applicable (e.g., motion in free space), and introduce learning in modules that require speculation (i.e., high-level decision-making modules, e.g., identifying promising directions for exploration, predicting where will an oncoming human go next, what is a good position to open a drawer from). Investigators will evaluate the effectiveness of proposed methods by comparing the efficiency of systems with and without predictive reasoning.

Publications

Opening Cabinets and Drawers in the Real World using a Commodity Mobile Manipulator
Arjun Gupta*, Michelle Zhang*, Rishik Sathua, Saurabh Gupta
arXiv, 2024
website

Abstract: Pulling open cabinets and drawers presents many difficult technical challenges in perception (inferring articulation parameters for objects from onboard sensors), planning (producing motion plans that conform to tight task constraints), and control (making and maintaining contact while applying forces on the environment). In this work, we build an end-to-end system that enables a commodity mobile manipulator (Stretch RE2) to pull open cabinets and drawers in diverse previously unseen real world environments. We conduct 4 days of real world testing of this system spanning 31 different objects from across 13 different real world environments. Our system achieves a success rate of 61% on opening novel cabinets and drawers in unseen environments zero-shot. An analysis of the failure modes suggests that errors in perception are the most significant challenge for our system. We will open source code and models for others to replicate and build upon our system.

Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops
Aditya Prakash, Arjun Gupta, Saurabh Gupta
arXiv, 2023
website

Abstract: Objects undergo varying amounts of perspective distortion as they move across a camera's field of view. Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view. We note that ignoring this location information further exaggerates the inherent ambiguity in making 3D inferences from 2D images and can prevent models from even fitting to the training data. To mitigate this ambiguity, we propose Intrinsics-Aware Positional Encoding (KPE), which incorporates information about the location of crops in the image and camera intrinsics. Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D shapes of articulated objects on ARCTIC, show the benefits of KPE.

Predicting Motion Plans for Articulating Everyday Objects
Arjun Gupta, Max Shepherd, Saurabh Gupta
International Conference on Robotics and Automation (ICRA), 2023
webpage / dataset

Abstract: Mobile manipulation tasks such as opening a door, pulling open a drawer, or lifting a toilet lid require constrained motion of the end-effector under environmental and task constraints. This, coupled with partial information in novel environments, makes it challenging to employ classical motion planning approaches at test time. Our key insight is to cast it as a learning problem to leverage past experience of solving similar planning problems to directly predict motion plans for mobile manipulation tasks in novel situations at test time. To enable this, we develop a simulator, ArtObjSim, that simulates articulated objects placed in real scenes. We then introduce SeqIK\(+\theta_0\), a fast and flexible representation for motion plans. Finally, we learn models that use SeqIK\(+\theta_0\) to quickly predict motion plans for articulating novel objects at test time. Experimental evaluation shows improved speed and accuracy at generating motion plans than pure search-based methods and pure learning methods.

Building Rearticulable Models for Arbitrary 3D Objects from 4D Point Clouds
Shaowei Liu, Saurabh Gupta*, Shenlong Wang*
Computer Vision and Pattern Recognition (CVPR), 2023
website / code

Abstract: We build rearticulable models for arbitrary everyday man-made objects containing an arbitrary number of parts that are connected together in arbitrary ways via 1 degreeof- freedom joints. Given point cloud videos of such everyday objects, our method identifies the distinct object parts, what parts are connected to what other parts, and the properties of the joints connecting each part pair. We do this by jointly optimizing the part segmentation, transformation, and kinematics using a novel energy minimization framework. Our inferred animatable models, enables retargeting to novel poses with sparse point correspondences guidance. We test our method on a new articulating robot dataset, and the Sapiens dataset with common daily objects, as well as real-world scans. Experiments show that our method outperforms two leading prior works on various metrics.

People

Contact

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. IIS-2143873 (Project Title: CAREER: Learning Predictive Models for Visual Navigation and Object Interaction , PI: Saurabh Gupta). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.