Saurabh Gupta

I am an Assistant Professor at UIUC. Before this, I was a Research Scientist at Facebook AI Research in Pittsburgh working with Prof. Abhinav Gupta. Earlier, I was a Computer Science graduate student at UC Berkeley, where I was advised by Prof. Jitendra Malik. Even earlier, I was an under graduate at IIT Delhi, in India, where I majored in Computer Science and Engineering.

Prospective Students: I am looking for strong and motivated students to work with. If you are interested in working with me, please directly apply through the CS or ECE departments and mention my name. No GRE Needed! You do not need to directly contact me (and if you do, I am sorry that I may not be able to promptly respond back to you). If you are already at UIUC, please fill up this form and I will be in touch if and when I have an opening that will be a good fit for you.

EMail / CSL 319 / CV / Scholar / Github / PhD Thesis

Research

I work on computer vision, robotics and machine learning. I am interested in building agents that can intelligently interact with the physical world around them. I am currently focusing on two aspects: a) choice and design of representations that enable such interaction, and b) what and how to learn from active interaction. Some example problems that we are tackling include: building spatio-semantic and topological representations for visual navigation, skill discovery, learning for and from videos, and active visual learning.

Research Group

Current

Past Members

Recent Talks

Teaching

Publications

2024

Opening Cabinets and Drawers in the Real World using a Commodity Mobile Manipulator
Arjun Gupta*, Michelle Zhang*, Rishik Sathua, Saurabh Gupta
arXiv, 2024
abstract / bibtex / website

Pulling open cabinets and drawers presents many difficult technical challenges in perception (inferring articulation parameters for objects from onboard sensors), planning (producing motion plans that conform to tight task constraints), and control (making and maintaining contact while applying forces on the environment). In this work, we build an end-to-end system that enables a commodity mobile manipulator (Stretch RE2) to pull open cabinets and drawers in diverse previously unseen real world environments. We conduct 4 days of real world testing of this system spanning 31 different objects from across 13 different real world environments. Our system achieves a success rate of 61% on opening novel cabinets and drawers in unseen environments zero-shot. An analysis of the failure modes suggests that errors in perception are the most significant challenge for our system. We will open source code and models for others to replicate and build upon our system.

@article{gupta2024opening,
author = "Gupta, Arjun and Zhang, Michelle and Sathua, Rishik and Gupta, Saurabh",
title = "Opening Cabinets and Drawers in the Real World using a Commodity Mobile Manipulator",
journal = "arXiv",
volume = "2402.17767",
year = "2024"
}

Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning
Xiaoyu Zhang*, Matthew Chang*, Pranav Kumar, Saurabh Gupta
arXiv, 2024
abstract / bibtex / website

A common failure mode for policies trained with imitation is compounding execution errors at test time. When the learned policy encounters states that were not present in the expert demonstrations, the policy fails, leading to degenerate behavior. The Dataset Aggregation, or DAgger approach to this problem simply collects more data to cover these failure states. However, in practice, this is often prohibitively expensive. In this work, we propose Diffusion Meets DAgger (DMD), a method to reap the benefits of DAgger without the cost for eye-in-hand imitation learning problems. Instead of collecting new samples to cover out-of-distribution states, DMD uses recent advances in diffusion models to create these samples with diffusion models. This leads to robust performance from few demonstrations. In experiments conducted for non-prehensile pushing on a Franka Research 3, we show that DMD can achieve a success rate of 80% with as few as 8 expert demonstrations, where naive behavior cloning reaches only 20%. DMD also outperform competing NeRF-based augmentation schemes by 50%.

@article{zhang2024diffusion,
author = "Zhang, Xiaoyu and Chang, Matthew and Kumar, Pranav and Gupta, Saurabh",
title = "Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning",
journal = "arXiv",
volume = "2402.17768",
year = "2024"
}

Bootstrapping Autonomous Radars with Self-Supervised Learning
Yiduo Hao*, Sohrab Madani*, Junfeng Guan, Mohammed Alloulah, Saurabh Gupta, Haitham Hassanieh
Computer Vision and Pattern Recognition (CVPR), 2024
abstract / bibtex

The perception of autonomous vehicles using radars has attracted increased research interest due its ability to operate in fog and bad weather. However, training radar models is hindered by the cost and difficulty of annotating large-scale radar data. To overcome this bottleneck, we propose a self-supervised learning framework to leverage the large amount of unlabeled radar data to pre-train radar-only embeddings for self-driving perception tasks. The proposed method combines radar-to-radar and radar-to-vision contrastive losses to learn a general representation from unlabeled radar heatmaps paired with their corresponding camera images. When used for downstream object detection, we demonstrate that the proposed self-supervision framework can improve the accuracy of state-of-the-art supervised baselines by 5.8% in mAP.

@inproceedings{hao2024bootstrapping,
author = "Hao*, Yiduo and Guan, Sohrab Madani*and Junfeng and Alloulah, Mohammed and Gupta, Saurabh and Hassanieh, Haitham",
title = "Bootstrapping Autonomous Radars with Self-Supervised Learning",
booktitle = "Computer Vision and Pattern Recognition (CVPR)",
year = "2024"
}

2023

GOAT: GO to Any Thing
Matthew Chang*, Theophile Gervet*, Mukul Khanna*, Sriram Yenamandra*, Dhruv Shah, So Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, Roozbeh Mottaghi, Jitendra Malik, Devendra Chaplot
arXiv, 2023
abstract / bibtex / website

In deployment scenarios such as homes and warehouses, mobile robots are expected to autonomously navigate for extended periods, seamlessly executing tasks articulated in terms that are intuitively understandable by human operators. We present GO To Any Thing (GOAT), a universal navigation system capable of tackling these requirements with three key features: a) Multimodal: it can tackle goals specified via category labels, target images, and language descriptions, b) Lifelong: it benefits from its past experience in the same environment, and c) Platform Agnostic: it can be quickly deployed on robots with different embodiments. GOAT is made possible through a modular system design and a continually augmented instance-aware semantic memory that keeps track of the appearance of objects from different viewpoints in addition to category-level semantics. This enables GOAT to distinguish between different instances of the same category to enable navigation to targets specified by images and language descriptions. In experimental comparisons spanning over 90 hours in 9 different homes consisting of 675 goals selected across 200+ different object instances, we find GOAT achieves an overall success rate of 83%, surpassing previous methods and ablations by 32% (absolute improvement). GOAT improves with experience in the environment, from a 60% success rate at the first goal to a 90% success after exploration. In addition, we demonstrate that GOAT can readily be applied to downstream tasks such as pick and place and social navigation.

@article{chang2023goat,
author = "Chang, Matthew and Gervet, Theophile and Khanna, Mukul and Yenamandra, Sriram and Shah, Dhruv and Min, So Yeon and Shah, Kavit and Paxton, Chris and Gupta, Saurabh and Batra, Dhruv and Mottaghi, Roozbeh and Malik, Jitendra and Chaplot, Devendra Singh",
title = "{GOAT: GO to Any Thing}",
year = "2023",
journal = "arXiv",
volume = "2311.06430"
}

3D Hand Pose Estimation in Egocentric Images in the Wild
Aditya Prakash, Ruisen Tu, Matthew Chang, Saurabh Gupta
arXiv, 2023
abstract / bibtex / website

We present WildHands, a method for 3D hand pose estimation in egocentric images in the wild. This is challenging due to (a) lack of 3D hand pose annotations for images in the wild, and (b) a form of perspective distortion-induced shape ambiguity that arises in the analysis of crops around hands. For the former, we use auxiliary supervision on in-the-wild data in the form of segmentation masks & grasp labels in addition to 3D supervision available in lab datasets. For the latter, we provide spatial cues about the location of the hand crop in the camera's field of view. Our approach achieves the best 3D hand pose on the ARCTIC leaderboard and outperforms FrankMocap, a popular and robust approach for estimating hand pose in the wild, by 45.3% when evaluated on 2D hand pose on our EPIC-HandKps dataset.

@article{prakash2023hand,
author = "Prakash, Aditya and Tu, Ruisen and Chang, Matthew and Gupta, Saurabh",
title = "3D Hand Pose Estimation in Egocentric Images in the Wild",
journal = "arXiv",
volume = "2312.06583",
year = "2023"
}

Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops
Aditya Prakash, Arjun Gupta, Saurabh Gupta
arXiv, 2023
abstract / bibtex / website

Objects undergo varying amounts of perspective distortion as they move across a camera's field of view. Models for predicting 3D from a single image often work with crops around the object of interest and ignore the location of the object in the camera's field of view. We note that ignoring this location information further exaggerates the inherent ambiguity in making 3D inferences from 2D images and can prevent models from even fitting to the training data. To mitigate this ambiguity, we propose Intrinsics-Aware Positional Encoding (KPE), which incorporates information about the location of crops in the image and camera intrinsics. Experiments on three popular 3D-from-a-single-image benchmarks: depth prediction on NYU, 3D object detection on KITTI & nuScenes, and predicting 3D shapes of articulated objects on ARCTIC, show the benefits of KPE.

@article{prakash2023mitigating,
author = "Prakash, Aditya and Gupta, Arjun and Gupta, Saurabh",
title = "Mitigating Perspective Distortion-induced Shape Ambiguity in Image Crops",
journal = "arXiv",
volume = "2312.06594",
year = "2023"
}

Push Past Green: Learning to Look Behind Plant Foliage by Moving It
Xiaoyu Zhang, Saurabh Gupta
Conference on Robot Learning (CoRL), 2023
abstract / bibtex / webpage / code+data

Autonomous agriculture applications (e.g., inspection, phenotyping, plucking fruits) require manipulating the plant foliage to look behind the leaves and the branches. Partial visibility, extreme clutter, thin structures, and unknown geometry and dynamics for plants make such manipulation challenging. We tackle these challenges through data-driven methods. We use self-supervision to train SRPNet, a neural network that predicts what space is revealed on execution of a candidate action on a given plant. We use SRPNet with the cross-entropy method to predict actions that are effective at revealing space beneath plant foliage. Furthermore, as SRPNet does not just predict how much space is revealed but also where it is revealed, we can execute a sequence of actions that incrementally reveal more and more space beneath the plant foliage. We experiment with a synthetic (vines) and a real plant (Dracaena) on a physical test-bed across 5 settings including 2 settings that test generalization to novel plant configurations. Our experiments reveal the effectiveness of our overall method, PPG, over a competitive hand-crafted exploration method, and the effectiveness of SRPNet over a hand-crafted dynamics model and relevant ablations.

@inproceedings{zhang2023push,
author = "Zhang, Xiaoyu and Gupta, Saurabh",
title = "Push Past Green: Learning to Look Behind Plant Foliage by Moving It",
year = "2023",
booktitle = "Conference on Robot Learning"
}

Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos
Matthew Chang, Aditya Prakash, Saurabh Gupta
Neural Information Processing Systems (NeurIPS), 2023
abstract / bibtex / website

The analysis and use of egocentric videos for robotic tasks is made challenging by occlusion due to the hand and the visual mismatch between the human hand and a robot end-effector. In this sense, the human hand presents a nuisance. However, often hands also provide a valuable signal, e.g. the hand pose may suggest what kind of object is being held. In this work, we propose to extract a factored representation of the scene that separates the agent (human hand) and the environment. This alleviates both occlusion and mismatch while preserving the signal, thereby easing the design of models for downstream robotics tasks. At the heart of this factorization is our proposed Video Inpainting via Diffusion Model (VIDM) that leverages both a prior on real-world images (through a large-scale pre-trained diffusion model) and the appearance of the object in earlier frames of the video (through attention). Our experiments demonstrate the effectiveness of VIDM at improving inpainting quality on egocentric videos and the power of our factored representation for numerous tasks: object detection, 3D reconstruction of manipulated objects, and learning of reward functions, policies, and affordances from videos.

@inproceedings{chang2023look,
author = "Chang, Matthew and Prakash, Aditya and Gupta, Saurabh",
title = "Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos",
booktitle = "Advances in Neural Information Processing Systems",
year = "2023"
}

Learning Hand-Held Object Reconstruction from In-The-Wild Videos
Aditya Prakash, Matthew Chang, Matthew Jin, Saurabh Gupta
arXiv, 2023
abstract / bibtex / website

Prior works for reconstructing hand-held objects from a single image rely on direct 3D shape supervision which is challenging to gather in real world at scale. Consequently, these approaches do not generalize well when presented with novel objects in in-the-wild settings. While 3D supervision is a major bottleneck, there is an abundance of in-the-wild raw video data showing hand-object interactions. In this paper, we automatically extract 3D supervision (via multiview 2D supervision) from such raw video data to scale up the learning of models for hand-held object reconstruction. This requires tackling two key challenges: unknown camera pose and occlusion. For the former, we use hand pose (predicted from existing techniques, e.g. FrankMocap) as a proxy for object pose. For the latter, we learn data-driven 3D shape priors using synthetic objects from the ObMan dataset. We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image. Our experiments on the MOW and HO3D datasets show the effectiveness of these supervisory signals at predicting the 3D shape for real-world hand-held objects without any direct real-world 3D supervision.

@article{prakash2023learning,
author = "Prakash, Aditya and Chang, Matthew and Jin, Matthew and Gupta, Saurabh",
title = "Learning Hand-Held Object Reconstruction from In-The-Wild Videos",
journal = "arXiv",
volume = "2305.03036",
year = "2023"
}

Object-centric Contact Field for Grasp Generation
Shaowei Liu, Yang Zhou, Jimei Yang, Saurabh Gupta*, Shenlong Wang*
International Conference on Computer Vision (ICCV), 2023
abstract / bibtex / webpage / code

This paper presents a novel object-centric contact representation ContactGen for hand-object interaction. The ContactGen comprises 3 components: a contact map indicates the contact location, a part map represents the contact hand part, and a direction map tells the contact direction within each part. Given an input object, we propose a conditional generative model to predict ContactGen and adopt model-based optimization to predict diverse and geometrically feasible grasps. Experimental results demonstrate our method can generate high-fidelity and diverse human grasps for various objects

@inproceedings{liu2023object,
author = "Liu, Shaowei and Zhou, Yang and Yang, Jimei and Gupta*, Saurabh and Wang*, Shenlong",
title = "Object-centric Contact Field for Grasp Generation",
year = "2023",
booktitle = "International Conference on Computer Vision (ICCV)"
}

Building Rearticulable Models for Arbitrary 3D Objects from 4D Point Clouds
Shaowei Liu, Saurabh Gupta*, Shenlong Wang*
Computer Vision and Pattern Recognition (CVPR), 2023
abstract / bibtex / website / code

We build rearticulable models for arbitrary everyday man-made objects containing an arbitrary number of parts that are connected together in arbitrary ways via 1 degreeof- freedom joints. Given point cloud videos of such everyday objects, our method identifies the distinct object parts, what parts are connected to what other parts, and the properties of the joints connecting each part pair. We do this by jointly optimizing the part segmentation, transformation, and kinematics using a novel energy minimization framework. Our inferred animatable models, enables retargeting to novel poses with sparse point correspondences guidance. We test our method on a new articulating robot dataset, and the Sapiens dataset with common daily objects, as well as real-world scans. Experiments show that our method outperforms two leading prior works on various metrics.

@inproceedings{liu2023building,
author = "Liu, Shaowei and Gupta*, Saurabh and Wang*, Shenlong",
title = "Building Rearticulable Models for Arbitrary 3D Objects from 4D Point Clouds",
booktitle = "Computer Vision and Pattern Recognition (CVPR)",
year = "2023"
}

Predicting Motion Plans for Articulating Everyday Objects
Arjun Gupta, Max Shepherd, Saurabh Gupta
International Conference on Robotics and Automation (ICRA), 2023
abstract / bibtex / webpage / dataset

Mobile manipulation tasks such as opening a door, pulling open a drawer, or lifting a toilet lid require constrained motion of the end-effector under environmental and task constraints. This, coupled with partial information in novel environments, makes it challenging to employ classical motion planning approaches at test time. Our key insight is to cast it as a learning problem to leverage past experience of solving similar planning problems to directly predict motion plans for mobile manipulation tasks in novel situations at test time. To enable this, we develop a simulator, ArtObjSim, that simulates articulated objects placed in real scenes. We then introduce SeqIK\(+\theta_0\), a fast and flexible representation for motion plans. Finally, we learn models that use SeqIK\(+\theta_0\) to quickly predict motion plans for articulating novel objects at test time. Experimental evaluation shows improved speed and accuracy at generating motion plans than pure search-based methods and pure learning methods.

@inproceedings{gupta2023predicting,
author = "Gupta, Arjun and Shepherd, Max and Gupta, Saurabh",
title = "Predicting Motion Plans for Articulating Everyday Objects",
booktitle = "International Conference on Robotics and Automation (ICRA)",
year = "2023",
organization = "IEEE"
}

One-shot Visual Imitation via Attributed Waypoints and Demonstration Augmentation
Matthew Chang, Saurabh Gupta
International Conference on Robotics and Automation (ICRA), 2023
abstract / bibtex / webpage / code

In this paper, we analyze the behavior of existing techniques and design new solutions for the problem of one-shot visual imitation. In this setting, an agent must solve a novel instance of a novel task given just a single visual demonstration. Our analysis reveals that current methods fall short because of three errors: the DAgger problem arising from purely offline training, last centimeter errors in interacting with objects, and mis-fitting to the task context rather than to the actual task. This motivates the design of our modular approach where we a) separate out task inference (what to do) from task execution (how to do it), and b) develop data augmentation and generation techniques to mitigate mis-fitting. The former allows us to leverage hand-crafted motor primitives for task execution which side-steps the DAgger problem and last centimeter errors, while the latter gets the model to focus on the task rather than the task context. Our model gets 100% and 48% success rates on two recent benchmarks, improving upon the current state-of-the-art by absolute 90% and 20% respectively.

@inproceedings{chang2023oneshot,
author = "Chang, Matthew and Gupta, Saurabh",
title = "One-shot Visual Imitation via Attributed Waypoints and Demonstration Augmentation",
booktitle = "International Conference on Robotics and Automation (ICRA)",
year = "2023",
organization = "IEEE"
}

Contactless Material Identification with Millimeter Wave Vibrometry
Hailan Shanbhag, Sohrab Madani, Akhil Isanaka, Deepak Nair, Saurabh Gupta, Haitham Hassanieh
International Conference on Mobile Systems, Applications and Services (MobiSys), 2023
abstract / bibtex

This paper introduces RFVibe, a system that enables contactless material and object identification through the fusion of millimeter wave wireless signals with acoustic signals. In particular, RFVibe plays an audio sound next to the object that generates micro-vibrations in the object. These micro-vibrations can be captured by shining a millimeter wave radar signal on the object and analyzing the phase of the reflected wireless signal. RFVibe can then extract several features including resonance frequencies and vibration modes, damping time of vibrations, and wireless reflection coefficients. These features are then used to enable more accurate identification, with a step towards generalizing towards different setups and locations. We implement RFVibe using an off-the-shelf millimeter-wave radar and an acoustic speaker. We evaluate it on 23 objects of 7 material types (Metal, Wood, Ceramic, Glass, Plastic, Cardboard, and Foam), obtaining 81.3% accuracy for material classification, a 30% improvement over prior work. RFVibe is able to classify with reasonable accuracy in scenarios that it has not encountered before, including different locations, angles, boundary conditions, and objects.

@inproceedings{shanbhag2023contactless,
author = "Shanbhag, Hailan and Madani, Sohrab and Isanaka, Akhil and Nair, Deepak and Gupta, Saurabh and Hassanieh, Haitham",
title = "Contactless Material Identification with Millimeter Wave Vibrometry",
booktitle = "International Conference on Mobile Systems, Applications and Services (MobiSys)",
year = "2023"
}

Exploiting Virtual Array Diversity For Accurate Radar Detection
Junfeng Guan, Sohrab Madani, Waleed Ahmed, Samah Hussein, Saurabh Gupta, Haitham Alhassanieh
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023
abstract / bibtex

Using millimeter-wave radars as a perception sensor provides self-driving cars with robust sensing capability in adverse weather. However, mmWave radars currently lack sufficient spatial resolution for semantic scene understanding. This paper introduces Radatron++, a system leverages cascaded MIMO (Multiple-Input Multiple-Output) radar to achieve accurate vehicle detection for self-driving cars. We develop a novel hybrid radar processing and deep learning approach to leverage the 10&

@inproceedings{guan2023exploiting,
author = "Guan, Junfeng and Madani, Sohrab and Ahmed, Waleed and Hussein, Samah and Gupta, Saurabh and Alhassanieh, Haitham",
title = "Exploiting Virtual Array Diversity For Accurate Radar Detection",
booktitle = "International Conference on Acoustics, Speech and Signal Processing (ICASSP)",
year = "2023"
}

2022

TIDEE: Room Reorganization using Visuo-Symbolic Common Sense Priors
Gabriel Sarch, Zhaoyuan Fang, Adam Harley, Paul Schydlo, Michael Tarr, Saurabh Gupta, Katerina Fragkiadaki
European Conference on Computer Vision (ECCV), 2022
abstract / bibtex / website

We introduce TIDEE, an embodied agent that tidies up a disordered scene based on learned commonsense object placement and room arrangement priors. TIDEE explores a home environment, detects objects that are out of their natural place, infers plausible object contexts for them, localizes such contexts in the current scene, and repositions the objects. Commonsense priors are encoded in three modules: i) visuo-semantic detectors that detect out-of-place objects, ii) an associative neural graph memory of objects and spatial relations that proposes plausible semantic receptacles and surfaces for object repositions, and iii) a visual search network that guides the agent's exploration for efficiently localizing the receptacle-of-interest in the current scene to reposition the object. We test TIDEE on tidying up disorganized scenes in the AI2THOR simulation environment. TIDEE carries out the task directly from pixel and raw depth input without ever having observed the same room beforehand, relying only on priors learned from a separate set of training houses. Human evaluations on the resulting room reorganizations show TIDEE outperforms ablative versions of the model that do not use one or more of the commonsense priors. On a related room rearrangement benchmark that allows the agent to view the goal state prior to rearrangement, a simplified version of our model significantly outperforms a top-performing method by a large margin. Code and data are available at the project website: https://tidee-agent.github.io/.

@inproceedings{sarch2022tidee,
author = "Sarch, Gabriel and Fang, Zhaoyuan and Harley, Adam and Schydlo, Paul and Tarr, Michael and Gupta, Saurabh and Fragkiadaki, Katerina",
title = "{TIDEE}: Room Reorganization using Visuo-Symbolic Common Sense Priors",
year = "2022",
booktitle = "European Conference on Computer Vision (ECCV)"
}

Radatron: Accurate Detection Using Multi-Resolution Cascaded MIMO Radar
Sohrab Madani*, Junfeng Guan*, Waleed Ahmed*, Saurabh Gupta, Haitham Hassanieh
European Conference on Computer Vision (ECCV), 2022
abstract / bibtex / website

Millimeter wave (mmWave) radars are becoming a more popular sensing modality in self-driving cars due to their favorable characteristics in adverse weather. Yet, they currently lack sufficient spatial resolution for semantic scene understanding. In this paper, we present Radatron, a system capable of accurate object detection using mmWave radar as a stand-alone sensor. To enable Radatron, we introduce a first-of-its-kind, high-resolution automotive radar dataset collected with a cascaded MIMO (Multiple Input Multiple Output) radar. Our radar achieves 5 cm range resolution and 1.2-degree angular resolution, 10x finer than other publicly available datasets. We also develop a novel hybrid radar processing and deep learning approach to achieve high vehicle detection accuracy. We train and extensively evaluate Radatron to show it achieves 92.6% AP50 and 56.3% AP75 accuracy in 2D bounding box detection, an 8% and 15.9% improvement over prior art respectively. Code and dataset are available on https://jguan.page/Radatron/.

@inproceedings{madani2022radatron,
author = "Madani*, Sohrab and Guan*, Junfeng and Ahmed*, Waleed and Gupta, Saurabh and Hassanieh, Haitham",
title = "Radatron: Accurate Detection Using Multi-Resolution Cascaded {MIMO} Radar",
year = "2022",
booktitle = "European Conference on Computer Vision (ECCV)"
}

On-Device CPU Scheduling for Robot Systems
Aditi Partap, Samuel Grayson, Muhammad Huzaifa, Sarita Adve, Brighten Godfrey, Saurabh Gupta, Kris Hauser, Radhika Mittal
International Conference on Intelligent Robots and Systems (IROS), 2022
abstract / bibtex

Robots have to take highly responsive real-time actions, driven by complex decisions involving a pipeline of sensing, perception, planning, and reaction tasks. These tasks must be scheduled on resource-constrained devices such that the performance goals and the requirements of the application are met. This is a difficult problem that requires handling mul- tiple scheduling dimensions, and variations in computational resource usage and availability. In practice, system designers manually tune parameters for their specific hardware and application, which results in poor generalization and increases the development burden. In this work, we highlight the emerging need for scheduling CPU resources at runtime in robot systems. We use robot navigation as a case-study to understand the key scheduling requirements for such systems. Armed with this understanding, we develop a CPU scheduling framework, Catan, that dynamically schedules compute resources across different components of an app so as to meet the specified application requirements. Through experiments with a prototype implemented on ROS, we show the impact of system scheduling on meeting the application's performance goals, and how Catan dynamically adapts to runtime variations.

@inproceedings{pratap2022ondevice,
author = "Partap, Aditi and Grayson, Samuel and Huzaifa, Muhammad and Adve, Sarita and Godfrey, Brighten and Gupta, Saurabh and Hauser, Kris and Mittal, Radhika",
title = "On-Device CPU Scheduling for Robot Systems",
year = "2022",
booktitle = "International Conference on Intelligent Robots and Systems (IROS)"
}

Human Hands as Probes for Interactive Object Understanding
Mohit Goyal, Sahil Modi, Rishabh Goyal, Saurabh Gupta
Computer Vision and Pattern Recognition (CVPR), 2022
abstract / bibtex / webpage / code+data

Interactive object understanding, or what we can do to objects and how is a long-standing goal of computer vision. In this paper, we tackle this problem through observation of human hands in in-the-wild egocentric videos. We demonstrate that observation of what human hands interact with and how can provide both the relevant data and the necessary supervision. Attending to hands, readily localizes and stabilizes active objects for learning and reveals places where interactions with objects occur. Analyzing the hands shows what we can do to objects and how. We apply these basic principles on the EPIC-KITCHENS dataset, and successfully learn state-sensitive features, and object affordances (regions of interaction and afforded grasps), purely by observing hands in egocentric videos.

@inproceedings{goyal2022human,
author = "Goyal, Mohit and Modi, Sahil and Goyal, Rishabh and Gupta, Saurabh",
title = "Human Hands as Probes for Interactive Object Understanding",
year = "2022",
booktitle = "Computer Vision and Pattern Recognition (CVPR)"
}

Learning Value Functions from Undirected State-only Experience
Matthew Chang*, Arjun Gupta*, Saurabh Gupta
International Conference on Learning Representations (ICLR), 2022
Deep Reinforcement Learning Workshop at NeurIPS, 2021
Offline Reinforcement Learning Workshop at NeurIPS, 2021
abstract / bibtex / webpage / arxiv link / code

This paper tackles the problem of learning value functions from undirected state-only experience (state transitions without action labels i.e. (s,s',r) tuples). We first theoretically characterize the applicability of Q-learning in this setting. We show that tabular Q-learning in discrete Markov decision processes (MDPs) learns the same value function under any arbitrary refinement of the action space. This theoretical result motivates the design of Latent Action Q-learning or LAQ, an offline RL method that can learn effective value functions from state-only experience. Latent Action Q-learning (LAQ) learns value functions using Q-learning on discrete latent actions obtained through a latent-variable future prediction model. We show that LAQ can recover value functions that have high correlation with value functions learned using ground truth actions. Value functions learned using LAQ lead to sample efficient acquisition of goal-directed behavior, can be used with domain-specific low-level controllers, and facilitate transfer across embodiments. Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the benefits of LAQ over simpler alternatives, imitation learning oracles, and competing methods.

@inproceedings{chang2022learning,
author = "Chang*, Matthew and Gupta*, Arjun and Gupta, Saurabh",
title = "Learning Value Functions from Undirected State-only Experience",
booktitle = "International Conference on Learning Representations",
year = "2022"
}

2021

RB2: Robotic Manipulation Benchmarking with a Twist
Sudeep Dasari, Jianren Wang, Joyce Hong, Shikhar Bahl, Yixin Lin, Austin Wang, Abitha Thankaraj, Karanbir Chahal, Berk Calli, Saurabh Gupta, David Held, Lerrel Pinto, Deepak Pathak, Vikash Kumar, Abhinav Gupta
Neural Information Processing Systems (NeurIPS), 2021
abstract / bibtex / webpage

Benchmarks offer a scientific way to compare algorithms using objective performance metrics. Good benchmarks have two features: (a) they should be widely useful for many research groups; (b) and they should produce reproducible findings. In robotic manipulation research, there is a trade-off between reproducibility and broad accessibility. If the benchmark is kept restrictive (fixed hardware, objects), the numbers are reproducible but the setup becomes less general. On the other hand, a benchmark could be a loose set of protocols (e.g. YCB object set) but the underlying variation in setups make the results non-reproducible. In this paper, we re-imagine benchmarking for robotic manipulation as state-of-the-art algorithmic implementations, alongside the usual set of tasks and experimental protocols. The added baseline implementations will provide a way to easily recreate SOTA numbers in a new local robotic setup, thus providing credible relative rankings between existing approaches and new work. However, these 'local rankings' could vary between different setups. To resolve this issue, we build a mechanism for pooling experimental data between labs, and thus we establish a single global ranking for existing (and proposed) SOTA algorithms. Our benchmark, called Ranking-Based Robotics Benchmark (RB2), is evaluated on tasks that are inspired from clinically validated Southampton Hand Assessment Procedures. Our benchmark was run across two different labs and reveals several surprising findings. For example, extremely simple baselines like open-loop behavior cloning, outperform more complicated models (e.g. closed loop, RNN, Offline-RL, etc.) that are preferred by the field. We hope our fellow researchers will use RB2 to improve their research's quality and rigor.

@inproceedings{dasari2021rb2,
author = "Dasari, Sudeep and Wang, Jianren and Hong, Joyce and Bahl, Shikhar and Lin, Yixin and Wang, Austin and Thankaraj, Abitha and Chahal, Karanbir and Calli, Berk and Gupta, Saurabh and Held, David and Pinto, Lerrel and Pathak, Deepak and Kumar, Vikash and Gupta, Abhinav",
title = "RB2: Robotic Manipulation Benchmarking with a Twist",
booktitle = "Advances in Neural Information Processing Systems (Track on Datasets and Benchmarks)",
year = "2021"
}

SEAL: Self-supervised Embodied Active Learning using Exploration and 3D Consistency
Devendra Chaplot, Murtaza Dalal, Saurabh Gupta, Jitendra Malik, Ruslan Salakhutdinov
Neural Information Processing Systems (NeurIPS), 2021
abstract / bibtex / website

In this paper, we explore how we can build upon the data and models of Internet images and use them to adapt to robot vision without requiring any extra labels. We present a framework called Self-supervised Embodied Active Learning (SEAL). It utilizes perception models trained on internet images to learn an active exploration policy. The observations gathered by this exploration policy are labelled using 3D consistency and used to improve the perception model. We build and utilize 3D semantic maps to learn both action and perception in a completely self-supervised manner. The semantic map is used to compute an intrinsic motivation reward for training the exploration policy and for labelling the agent observations using spatio-temporal 3D consistency and label propagation. We demonstrate that the SEAL framework can be used to close the action-perception loop: it improves object detection and instance segmentation performance of a pretrained perception model by just moving around in training environments and the improved perception model can be used to improve Object Goal Navigation.

@inproceedings{chaplot2021seal,
author = "Chaplot, Devendra and Dalal, Murtaza and Gupta, Saurabh and Malik, Jitendra and Salakhutdinov, Ruslan",
title = "SEAL: Self-supervised Embodied Active Learning using Exploration and 3D Consistency",
booktitle = "Advances in Neural Information Processing Systems",
year = "2021"
}

Learned Visual Navigation for Under-Canopy Agricultural Robots
Arun Sivakumar, Sahil Modi, Mateus Gasparino, Che Ellis, Andres Velasquez, Girish Chowdhary*, Saurabh Gupta*
Robotics: Science and Systems (RSS), 2021
abstract / bibtex / website

This paper describes a system for visually guided autonomous navigation of under-canopy farm robots. Low-cost under-canopy robots can drive between crop rows under the plant canopy and accomplish tasks that are infeasible for over-the-canopy drones or larger agricultural equipment. However, autonomously navigating them under the canopy presents a number of challenges: unreliable GPS and LiDAR, high cost of sensing, challenging farm terrain, clutter due to leaves and weeds, and large variability in appearance over the season and across crop types. We address these challenges by building a modular system that leverages machine learning for robust and generalizable perception from monocular RGB images from low-cost cameras, and model predictive control for accurate control in challenging terrain. Our system, CropFollow, is able to autonomously drive 485 meters per intervention on average, outperforming a state-of-the-art LiDAR based system (286 meters per intervention) in extensive field testing spanning over 25 km.

@inproceedings{sivakumar2021learned,
author = "Sivakumar, Arun Narenthiran and Modi, Sahil and Gasparino, Mateus Valverde and Ellis, Che and Velasquez, Andres Baquero and Chowdhary, Girish and Gupta, Saurabh",
title = "Learned Visual Navigation for Under-Canopy Agricultural Robots",
booktitle = "Robotics: Science and Systems",
year = "2021"
}

On the Use of ML for Blackbox System Performance Prediction
Silvery Fu, Saurabh Gupta, Radhika Mittal, Sylvia Ratnasamy
USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2021
abstract / bibtex / slides / video / code / dataset

There is a growing body of work that reports positive results from applying ML-based performance prediction to a particular application or use-case (e.g. server configuration, capacity planning). Yet, a critical question remains unanswered: does ML make prediction simpler (i.e., allowing us to treat systems as blackboxes) and general (i.e., across a range of applications and use-cases)? After all, the potential for simplicity and generality is a key part of what makes ML-based prediction so attractive compared to the traditional approach of relying on handcrafted and specialized performance models. In this paper, we attempt to answer this broader question. We develop a methodology for systematically diagnosing whether, when, and why ML does (not) work for performance prediction, and identify steps to improve predictability. We apply our methodology to test 6 ML models in predicting the performance of 13 real-world applications. We find that 12 out of our 13 applications exhibit inherent variability in performance that fundamentally limits prediction accuracy. Our findings motivate the need for system-level modifications and/or ML-level extensions that can improve predictability, showing how ML fails to be an easy-to-use predictor. On implementing and evaluating these changes, we find that while they do improve the overall prediction accuracy, prediction error remains high for multiple realistic scenarios, showing how ML fails as a general predictor.

@inproceedings{fu2021use,
author = "Fu, Silvery and Gupta, Saurabh and Mittal, Radhika and Ratnasamy, Sylvia",
title = "On the Use of {ML} for Blackbox System Performance Prediction",
booktitle = "USENIX Symposium on Networked Systems Design and Implementation (NSDI)",
year = "2021"
}

2020

Semantic Visual Navigation by Watching YouTube Videos
Matthew Chang, Arjun Gupta, Saurabh Gupta
Neural Information Processing Systems (NeurIPS), 2020
abstract / bibtex / arxiv link / webpage / video / code

Semantic cues and statistical regularities in real-world environment layouts can improve efficiency for navigation in novel environments. This paper learns and leverages such semantic cues for navigating to objects of interest in novel environments, by simply watching YouTube videos. This is challenging because YouTube videos do not come with labels for actions or goals, and may not even showcase optimal behavior. Our method tackles these challenges through the use of Q-learning on pseudo-labeled transition quadruples (image, action, next image, reward). We show that such off-policy Q-learning from passive data is able to learn meaningful semantic cues for navigation. These cues, when used in a hierarchical navigation policy, lead to improved efficiency at the ObjectGoal task in visually realistic simulations. We observe a relative improvement of 15-83% over end-to-end RL, behavior cloning, and classical methods, while using minimal direct interaction.

@inproceedings{chang2020semantic,
author = "Chang, Matthew and Gupta, Arjun and Gupta, Saurabh",
title = "Semantic Visual Navigation by Watching YouTube Videos",
booktitle = "Advances in Neural Information Processing Systems",
year = "2020"
}

Semantic Curiosity for Active Visual Learning
Devendra Chaplot*, Helen Jiang*, Saurabh Gupta, Abhinav Gupta
European Conference on Computer Vision (ECCV), 2020
abstract / bibtex / arxiv link / webpage

In this paper, we study the task of embodied interactive learning for object detection. Given a set of environments (and some labeling budget), our goal is to learn an object detector by having an agent select what data to obtain labels for. How should an exploration policy decide which trajectory should be labeled? One possibility is to use a trained object detector's failure cases as an external reward. However, this will require labeling millions of frames required for training RL policies, which is infeasible. Instead, we explore a self-supervised approach for training our exploration policy by introducing a notion of semantic curiosity. Our semantic curiosity policy is based on a simple observation -- the detection outputs should be consistent. Therefore, our semantic curiosity rewards trajectories with inconsistent labeling behavior and encourages the exploration policy to explore such areas. The exploration policy trained via semantic curiosity generalizes to novel scenes and helps train an object detector that outperforms baselines trained with other possible alternatives such as random exploration, prediction-error curiosity, and coverage-maximizing exploration.

@inproceedings{chaplot2020semantic,
author = "Chaplot, Devendra Singh and Jiang, Helen and Gupta, Saurabh and Gupta, Abhinav",
title = "Semantic Curiosity for Active Visual Learning",
year = "2020",
booktitle = "European Conference on Computer Vision (ECCV)"
}

Aligning Videos in Space and Time
Senthil Purushwalkam*, Tian Ye*, Saurabh Gupta, Abhinav Gupta
European Conference on Computer Vision (ECCV), 2020
abstract / bibtex / arxiv link

In this paper, we focus on the task of extracting visual cor- respondences across videos. Given a query video clip from an action class, we aim to align it with training videos in space and time. Obtaining training data for such a fine-grained alignment task is challenging and often ambiguous. Hence, we propose a novel alignment procedure that learns such correspondence in space and time via cross video cycle-consistency. During training, given a pair of videos, we compute cycles that connect patches in a given frame in the first video by matching through frames in the second video. Cycles that connect overlapping patches together are encouraged to score higher than cycles that connect non-overlapping patches. Our experiments on the Penn Action and Pouring datasets demonstrate that the proposed method can successfully learn to correspond semantically similar patches across videos, and learns representations that are sensitive to object and action states.

@inproceedings{purushwalkam2020aligning,
author = "Purushwalkam, Senthil and Ye, Tian and Gupta, Saurabh and Gupta, Abhinav",
title = "Aligning Videos in Space and Time",
year = "2020",
booktitle = "European Conference on Computer Vision (ECCV)"
}

Neural Topological SLAM for Visual Navigation
Devendra Chaplot, Ruslan Salakhutdinov*, Abhinav Gupta*, Saurabh Gupta*
Computer Vision and Pattern Recognition (CVPR), 2020
abstract / bibtex / webpage / video

This paper studies the problem of image-goal navigation which involves navigating to the location indicated by a goal image in a novel previously unseen environment. To tackle this problem, we design topological representations for space that effectively leverage semantics and afford approximate geometric reasoning. At the heart of our representations are nodes with associated semantic features, that are interconnected using coarse geometric information. We describe supervised learning-based algorithms that can build, maintain and use such representations under noisy actuation. Experimental study in visually and physically realistic simulation suggests that our method builds effective representations that capture structural regularities and efficiently solve long-horizon navigation problems. We observe a relative improvement of more than 50% over existing methods that study this task.

@inproceedings{chaplot2020neural,
author = "Chaplot, Devendra Singh and Salakhutdinov, Ruslan and Gupta, Abhinav and Gupta, Saurabh",
title = "Neural Topological SLAM for Visual Navigation",
year = "2020",
booktitle = "Computer Vision and Pattern Recognition (CVPR)"
}

Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects
Kiana Ehsani, Shubham Tulsiani, Saurabh Gupta, Ali Farhadi, Abhinav Gupta
Computer Vision and Pattern Recognition (CVPR), 2020
abstract / bibtex / arxiv link / webpage / code+data

When we humans look at a video of human-object interaction, we can not only infer what is happening but we can even extract actionable information and imitate those interactions. On the other hand, current recognition or geometric approaches lack the physicality of action representation. In this paper, we take a step towards a more physical understanding of actions. We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects. One of the main challenges in tackling this problem is obtaining ground-truth labels for forces. We sidestep this problem by instead using a physics simulator for supervision. Specifically, we use a simulator to predict effects and enforce that estimated forces must lead to the same effect as depicted in the video. Our quantitative and qualitative results show that (a) we can predict meaningful forces from videos whose effects lead to accurate imitation of the motions observed, (b) by jointly optimizing for contact point and force prediction, we can improve the performance on both tasks in comparison to independent training, and (c) we can learn a representation from this model that generalizes to novel objects using few shot examples.

@inproceedings{ehsani2020force,
author = "Ehsani, Kiana and Tulsiani, Shubham and Gupta, Saurabh and Farhadi, Ali and Gupta, Abhinav",
title = "Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects",
year = "2020",
booktitle = "Computer Vision and Pattern Recognition (CVPR)"
}

Through Fog High Resolution Imaging Using Millimeter Wave Radar
Junfeng Guan, Sohrab Madani, Suraj Jog, Saurabh Gupta, Haitham Hassanieh
Computer Vision and Pattern Recognition (CVPR), 2020
abstract / bibtex / website

This paper demonstrates high-resolution imaging using millimeter wave (mmWave) radars that can function even in dense fog. We leverage the fact that mmWave signals have favorable propagation characteristics in low visibility conditions, unlike optical sensors like cameras and LiDARs which cannot penetrate through dense fog. Millimeter wave radars, however, suffer from very low resolution, specularity, and noise artifacts. We introduce HawkEye, a system that leverages a cGAN architecture to recover high-frequency shapes from raw low-resolution mmWave heatmaps. We propose a novel design that addresses challenges specific to the structure and nature of the radar signals involved. We also develop a data synthesizer to aid with large-scale dataset generation for training. We implement our system on a custom-built mmWave radar platform and demonstrate performance improvement over both standard mmWave radars and other competitive baselines.

@inproceedings{guan2020through,
author = "Guan, Junfeng and Madani, Sohrab and Jog, Suraj and Gupta, Saurabh and Hassanieh, Haitham",
title = "Through Fog High Resolution Imaging Using Millimeter Wave Radar",
year = "2020",
booktitle = "Computer Vision and Pattern Recognition (CVPR)"
}

Efficient Bimanual Manipulation Using Learned Task Schemas
Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, Abhinav Gupta
International Conference on Robotics and Automation (ICRA), 2020
abstract / bibtex / video

We address the problem of effectively composing skills to solve sparse-reward tasks in the real world. Given a set of parameterized skills (such as exerting a force or doing a top grasp at a location), our goal is to learn policies that invoke these skills to efficiently solve such tasks. Our insight is that for many tasks, the learning process can be decomposed into learning a state-independent task schema (a sequence of skills to execute) and a policy to choose the parameterizations of the skills in a state-dependent manner. For such tasks, we show that explicitly modeling the schema's state-independence can yield significant improvements in sample efficiency for model-free reinforcement learning algorithms. Furthermore, these schemas can be transferred to solve related tasks, by simply re-learning the parameterizations with which the skills are invoked. We find that doing so enables learning to solve sparse-reward tasks on real-world robotic systems very efficiently. We validate our approach experimentally over a suite of robotic bimanual manipulation tasks, both in simulation and on real hardware.

@inproceedings{chitnis2020efficient,
author = "Chitnis, Rohan and Tulsiani, Shubham and Gupta, Saurabh and Gupta, Abhinav",
title = "Efficient Bimanual Manipulation Using Learned Task Schemas",
booktitle = "International Conference on Robotics and Automation",
year = "2020"
}

Intrinsic Motivation for Encouraging Synergistic Behavior
Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, Abhinav Gupta
International Conference on Learning Representations (ICLR), 2020
abstract / bibtex / webpage

We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks, which are tasks where multiple agents must work together to achieve a goal they could not individually. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own. Thus, we propose to incentivize agents to take (joint) actions whose effects cannot be predicted via a composition of the predicted effect for each individual agent. We study two instantiations of this idea, one based on the true states encountered, and another based on a dynamics model trained concurrently with the policy. While the former is simpler, the latter has the benefit of being analytically differentiable with respect to the action taken. We validate our approach in robotic bimanual manipulation tasks with sparse rewards; we find that our approach yields more efficient learning than both 1) training with only the sparse reward and 2) using the typical surprise-based formulation of intrinsic motivation, which does not bias toward synergistic behavior. Videos are available on the project webpage: https://sites.google.com/view/iclr2020-synergistic.

@inproceedings{chitnis2020intrinsic,
author = "Chitnis, Rohan and Tulsiani, Shubham and Gupta, Saurabh and Gupta, Abhinav",
title = "Intrinsic Motivation for Encouraging Synergistic Behavior",
booktitle = "International Conference on Learning Representations",
year = "2020",
url = "https://openreview.net/forum?id=SJleNCNtDH"
}

Learning To Explore Using Active Neural Mapping
Devendra Chaplot, Dhiraj Gandhi, Saurabh Gupta*, Abhinav Gupta*, Ruslan Salakhutdinov*
International Conference on Learning Representations (ICLR), 2020
abstract / bibtex / openreview link / webpage / code / video / slides

This work presents a modular and hierarchical approach to learn policies for exploring 3D environments. Our approach leverages the strengths of both classical and learning-based methods, by using analytical path planners with learned mappers, and global and local policies. Use of learning provides flexibility with respect to input modalities (in mapper), leverages structural regularities of the world (in global policies), and provides robustness to errors in state estimation (in local policies). Such use of learning within each module retains its benefits, while at the same time, hierarchical decomposition and modular training allow us to sidestep the high sample complexities associated with training end-to-end policies. Our experiments in visually and physically realistic simulated 3D environments demonstrate the effectiveness of our proposed approach over past learning and geometry-based approaches.

@inproceedings{chaplot2020learning,
author = "Chaplot, Devendra Singh and Gandhi, Dhiraj and Gupta, Saurabh and Gupta, Abhinav and Salakhutdinov, Ruslan",
title = "Learning To Explore Using Active Neural Mapping",
booktitle = "International Conference on Learning Representations",
year = "2020",
url = "https://openreview.net/pdf?id=HklXn1BKDH"
}

Learning to Move with Affordance Maps
William Qi, Ravi Mullapudi, Saurabh Gupta, Deva Ramanan
International Conference on Learning Representations (ICLR), 2020
abstract / bibtex / openreview link / code

The ability to autonomously explore and navigate a physical space is a fundamental requirement for virtually any mobile autonomous agent, from household robotic vacuums to autonomous vehicles. Traditional SLAM-based approaches for exploration and navigation largely focus on leveraging scene geometry, but fail to model dynamic objects (such as other agents) or semantic constraints (such as wet floors or doorways). Learning-based RL agents are an attractive alternative because they can incorporate both semantic and geometric information, but are notoriously sample inefficient, difficult to generalize to novel settings, and are difficult to interpret. In this paper, we combine the best of both worlds with a modular approach that learns a spatial representation of a scene that is trained to be effective when coupled with traditional geometric planners. Specifically, we design an agent that learns to predict a spatial affordance map that elucidates what parts of a scene are navigable through active self-supervised experience gathering. In contrast to most simulation environments that assume a static world, we evaluate our approach in the VizDoom simulator, using large-scale randomly-generated maps containing a variety of dynamic actors and hazards. We show that learned affordance maps can be used to augment traditional approaches for both exploration and navigation, providing significant improvements in performance.

@inproceedings{qi2020learning,
author = "Qi, William and Mullapudi, Ravi Teja and Gupta, Saurabh and Ramanan, Deva",
title = "Learning to Move with Affordance Maps",
booktitle = "International Conference on Learning Representations",
year = "2020",
url = "https://openreview.net/pdf?id=BJgMFxrYPB"
}

2019

Learning Navigation Subroutines from Egocentric Videos
Ashish Kumar, Saurabh Gupta, Jitendra Malik
Conference on Robot Learning (CoRL), 2019
abstract / bibtex / website / arXiv link / code

Hierarchies are an effective way to boost sample efficiency in reinforcement learning, and computational efficiency in classical planning. However, acquiring hierarchies via hand-design (as in classical planning) is suboptimal, while acquiring them via end-to-end reward based training (as in reinforcement learning) is unstable and still prohibitively expensive. In this paper, we pursue an alternate paradigm for acquiring such hierarchical abstractions (or visuo-motor subroutines), via use of passive first-person observation data. We use an inverse model trained on small amounts of interaction data to pseudo-label the passive first person videos with agent actions. Visuo-motor subroutines are acquired from these pseudo-labeled videos by learning a latent intent-conditioned policy that predicts the inferred pseudo-actions from the corresponding image observations. We demonstrate our proposed approach in context of navigation, and show that we can successfully learn consistent and diverse visuo-motor subroutines from passive first-person videos. We demonstrate the utility of our acquired visuo-motor subroutines by using them as is for exploration, and as sub-policies in a hierarchical RL framework for reaching point goals and semantic goals. We also demonstrate behavior of our subroutines in the real world, by deploying them on a real robotic platform.

@inproceedings{kumar2019learning,
author = "Kumar, Ashish and Gupta, Saurabh and Malik, Jitendra",
title = "Learning Navigation Subroutines from Egocentric Videos",
booktitle = "Conference on Robot Learning",
year = "2019"
}

Combining Optimal Control and Learning for Visual Navigation in Novel Environments
Somil Bansal, Varun Tolani, Saurabh Gupta, Jitendra Malik, Claire Tomlin
Conference on Robot Learning (CoRL), 2019
abstract / bibtex / website / code

Model-based control is a popular paradigm for robot navigation because it can leverage a known dynamics model to efficiently plan robust robot trajectories. However, it is challenging to use model-based methods in settings where the environment is apriori unknown and can only be observed partially through on-board sensors on the robot. In this work, we address this short-coming by coupling model-based control with learning-based perception. The learning-based perception module produces a series of waypoints that guide the robot to the goal via a collision-free path. These waypoints are used by a model-based planner to generate a smooth and dynamically feasible trajectory that is executed on the physical system using feedback control. Our experiments in simulated real-world cluttered environments and on an actual ground vehicle demonstrate that the proposed approach can reach goal locations more reliably and efficiently in novel, previously-unknown environments as compared to a purely end-to-end learning-based alternative. Our approach, which we refer to as WayPtNav (WayPoint-based Navigation), is successfully able to exhibit goal-driven behavior without relying on detailed explicit 3D maps of the environment, works well with low frame rates, and generalizes well from simulation to the real world.

@inproceedings{bansal2019combining,
author = "Bansal, Somil and Tolani, Varun and Gupta, Saurabh and Malik, Jitendra and Tomlin, Claire",
title = "Combining Optimal Control and Learning for Visual Navigation in Novel Environments",
booktitle = "Conference on Robot Learning",
year = "2019"
}

Learning Exploration Policies for Navigation
Tao Chen, Saurabh Gupta, Abhinav Gupta
International Conference on Learning Representations (ICLR), 2019
abstract / bibtex / arXiv link / website

Numerous past works have tackled the problem of task-driven navigation. But, how to effectively explore a new environment to enable a variety of down-stream tasks has received much less attention. In this work, we study how agents can autonomously explore realistic and complex 3D environments without the context of task-rewards. We propose a learning-based approach and investigate different policy architectures, reward functions, and training paradigms. We find that use of policies with spatial memory that are bootstrapped with imitation learning and finally finetuned with coverage rewards derived purely from on-board sensors can be effective at exploring novel environments. We show that our learned exploration policies can explore better than classical approaches based on geometry alone and generic learning-based exploration techniques. Finally, we also show how such task-agnostic exploration can be used for down-stream tasks. Videos are available at https://sites.google.com/view/exploration-for-nav/.

@inproceedings{chen2018learning,
author = "Chen, Tao and Gupta, Saurabh and Gupta, Abhinav",
title = "Learning Exploration Policies for Navigation",
booktitle = "International Conference on Learning Representations",
year = "2019",
url = "https://openreview.net/forum?id=SyMWn05F7"
}

Cognitive mapping and planning for visual navigation
Saurabh Gupta, Varun Tolani, James Davidson, Sergey Levine, Rahul Sukthankar, Jitendra Malik
International Journal of Computer Vision (IJCV), 2019
abstract / bibtex / website / arXiv link / code+simulation environment

We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. We train and test CMP on navigation problems in simulation environments derived from scans of real world buildings. Our experiments demonstrate that CMP outperforms alternate learning-based architectures, as well as, classical mapping and path planning approaches in many cases. Furthermore, it naturally extends to semantically specified goals, such as "going to a chair". We also deploy CMP on physical robots in indoor environments, where it achieves reasonable performance, even though it is trained entirely in simulation.

@article{gupta2019cognitive,
author = "Gupta, Saurabh and Tolani, Varun and Davidson, James and Levine, Sergey and Sukthankar, Rahul and Malik, Jitendra",
title = "Cognitive mapping and planning for visual navigation",
journal = "International Journal of Computer Vision",
year = "2019"
}

PyRobot: An Open-source Robotics Framework for Research and Benchmarking
Adithyavairavan Murali*, Tao Chen*, Kalyan Vasudev Alwala*, Dhiraj Gandhi*, Lerrel Pinto, Saurabh Gupta, Abhinav Gupta
arXiv, 2019
abstract / bibtex / locobot robot / pyrobot tutorial / pyrobot code / arXiv link

This paper introduces PyRobot, an open-source robotics framework for research and benchmarking. PyRobot is a light-weight, high-level interface on top of ROS that provides a consistent set of hardware independent mid-level APIs to control different robots. PyRobot abstracts away details about low-level controllers and inter-process communication, and allows non-robotics researchers (ML, CV researchers) to focus on building high-level AI applications. PyRobot aims to provide a research ecosystem with convenient access to robotics datasets, algorithm implementations and models that can be used to quickly create a state-of-the-art baseline. We believe PyRobot, when paired up with low-cost robot platforms such as LoCoBot, will reduce the entry barrier into robotics, and democratize robotics. PyRobot is open-source, and can be accessed online.

@article{murali2019pyrobot,
author = "Murali*, Adithyavairavan and Chen*, Tao and Alwala*, Kalyan Vasudev and Gandhi*, Dhiraj and Pinto, Lerrel and Gupta, Saurabh and Gupta, Abhinav",
title = "PyRobot: An Open-source Robotics Framework for Research and Benchmarking",
journal = "arXiv preprint arXiv:1906.08236",
year = "2019"
}

Segmenting unknown 3D objects from real depth images using mask R-CNN trained on synthetic point clouds
Michael Danielczuk, Matthew Matl, Saurabh Gupta, Andrew Li, Andrew Lee, Jeffrey Mahler, Ken Goldberg
International Conference on Robotics and Automation (ICRA), 2019
abstract / bibtex / website / code / arXiv link

The ability to segment unknown objects in depth images has potential to enhance robot skills in grasping and object tracking. Recent computer vision research has demonstrated that Mask R-CNN can be trained to segment specific categories of objects in RGB images when massive hand-labeled datasets are available. As generating these datasets is time consuming, we instead train with synthetic depth images. Many robots now use depth sensors, and recent results suggest training on synthetic depth data can transfer successfully to the real world. We present a method for automated dataset generation and rapidly generate a synthetic training dataset of 50,000 depth images and 320,000 object masks using simulated heaps of 3D CAD models. We train a variant of Mask R-CNN with domain randomization on the generated dataset to perform category-agnostic instance segmentation without any hand-labeled data and we evaluate the trained network, which we refer to as Synthetic Depth (SD) Mask R-CNN, on a set of real, high-resolution depth images of challenging, densely-cluttered bins containing objects with highly-varied geometry. SD Mask R-CNN outperforms point cloud clustering baselines by an absolute 15% in Average Precision and 20% in Average Recall on COCO benchmarks, and achieves performance levels similar to a Mask R-CNN trained on a massive, hand-labeled RGB dataset and fine-tuned on real images from the experimental setup. We deploy the model in an instance-specific grasping pipeline to demonstrate its usefulness in a robotics application. Code, the synthetic training dataset, and supplementary material are available online.

@inproceedings{danielczuk2019segmenting,
author = "Danielczuk, Michael and Matl, Matthew and Gupta, Saurabh and Li, Andrew and Lee, Andrew and Mahler, Jeffrey and Goldberg, Ken",
title = "Segmenting unknown {3D} objects from real depth images using mask {R-CNN} trained on synthetic point clouds",
booktitle = "International Conference on Robotics and Automation",
year = "2019"
}

2018

Visual Memory for Robust Path Following
Ashish Kumar*, Saurabh Gupta*, David Fouhey, Sergey Levine, Jitendra Malik
Neural Information Processing Systems (NeurIPS), 2018
abstract / bibtex / webpage

Humans routinely retrace paths in a novel environment both forwards and backwards despite uncertainty in their motion. This paper presents an approach for doing so. Given a demonstration of a path, a first network generates a path abstraction. Equipped with this abstraction, a second network observes the world and decides how to act to retrace the path under noisy actuation and a changing environment. The two networks are optimized end-to-end at training time. We evaluate the method in two realistic simulators, performing path following and homing under actuation noise and environmental changes. Our experiments show that our approach outperforms classical approaches and other learning based baselines.

@inproceedings{kumar2018visual,
author = "Kumar*, Ashish and Gupta*, Saurabh and Fouhey, David and Levine, Sergey and Malik, Jitendra",
title = "Visual Memory for Robust Path Following",
booktitle = "Advances in Neural Information Processing Systems",
year = "2018"
}

Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene
Shubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei Efros, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2018
abstract / bibtex / webpage / arXiv link / code

The goal of this paper is to take a single 2D image of a scene and recover the 3D structure in terms of a small set of factors: a layout representing the enclosing surfaces as well as a set of objects represented in terms of shape and pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a large dataset of indoor scenes. Our experiments evaluate a number of practical design questions, demonstrate that we can infer this representation, and quantitatively and qualitatively demonstrate its merits compared to alternate representations.

@inproceedings{tulsiani2018factoring,
author = "Tulsiani, Shubham and Gupta, Saurabh and Fouhey, David and Efros, Alexei A and Malik, Jitendra",
title = "Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene",
booktitle = "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition",
year = "2018"
}

On Evaluation of Embodied Navigation Agents
Peter Anderson, Angel Chang, Devendra Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, Amir Zamir
arXiv, 2018
abstract / bibtex / arXiv link

Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study empirical methodology in navigation research. The present document summarizes the consensus recommendations of this working group. We discuss different problem statements and the role of generalization, present evaluation measures, and provide standard scenarios that can be used for benchmarking.

@article{anderson2018evaluation,
author = "Anderson, Peter and Chang, Angel and Chaplot, Devendra Singh and Dosovitskiy, Alexey and Gupta, Saurabh and Koltun, Vladlen and Kosecka, Jana and Malik, Jitendra and Mottaghi, Roozbeh and Savva, Manolis and Zamir, Amir",
title = "On Evaluation of Embodied Navigation Agents",
journal = "arXiv preprint arXiv:1807.06757",
year = "2018"
}

2017

Cognitive mapping and planning for visual navigation
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2017
abstract / bibtex / code+simulation environment / arXiv link / website

We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the planner, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a topdown belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. Our experiments demonstrate that CMP outperforms both reactive strategies and standard memory-based architectures and performs well in novel environments. Furthermore, we show that CMP can also achieve semantically specified goals, such as "go to a chair".

@inproceedings{gupta2017cognitive,
author = "Gupta, Saurabh and Davidson, James and Levine, Sergey and Sukthankar, Rahul and Malik, Jitendra",
title = "Cognitive mapping and planning for visual navigation",
booktitle = "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition",
year = "2017"
}

Unifying Map and Landmark based Representations for Visual Navigation
Saurabh Gupta, David Fouhey, Sergey Levine, Jitendra Malik
arXiv, 2017
abstract / bibtex / arXiv link / webpage

This works presents a formulation for visual navigation that unifies map based spatial reasoning and path planning, with landmark based robust plan execution in noisy environments. Our proposed formulation is learned from data and is thus able to leverage statistical regularities of the world. This allows it to efficiently navigate in novel environments given only a sparse set of registered images as input for building representations for space. Our formula- tion is based on three key ideas: a learned path planner that outputs path plans to reach the goal, a feature synthesis engine that predicts features for locations along the planned path, and a learned goal-driven closed loop controller that can follow plans given these synthesized features. We test our approach for goal-driven navigation in simulated real world environments and report performance gains over competitive baseline approaches.

@article{gupta2017unifying,
author = "Gupta, Saurabh and Fouhey, David and Levine, Sergey and Malik, Jitendra",
title = "Unifying Map and Landmark based Representations for Visual Navigation",
journal = "arXiv preprint arXiv:1712.08125",
year = "2017"
}

2016

Cross modal distillation for supervision transfer
Saurabh Gupta, Judy Hoffman, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2016
abstract / bibtex / arXiv link / NYUD2 Detectors + Supervision Transfer Models / data

In this work we propose a technique that transfers supervision between images from different modalities. We use learned representations from a large labeled modality as supervisory signal for training representations for a new unlabeled paired modality. Our method enables learning of rich representations for unlabeled modalities and can be used as a pre-training procedure for new modalities with limited labeled data. We transfer supervision from labeled RGB images to unlabeled depth and optical flow images and demonstrate large improvements for both these cross modal supervision transfers.

@inproceedings{gupta2016cross,
author = "Gupta, Saurabh and Hoffman, Judy and Malik, Jitendra",
title = "Cross modal distillation for supervision transfer",
booktitle = "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition",
pages = "2827--2836",
year = "2016"
}

Learning with side information through modality hallucination
Judy Hoffman, Saurabh Gupta, Trevor Darrell
Computer Vision and Pattern Recognition (CVPR), 2016
abstract / bibtex

We present a modality hallucination architecture for training an RGB object detection model which incorporates depth side information at training time. Our convolutional hallucination network learns a new and complementary RGB image representation which is taught to mimic convolutional mid-level features from a depth network. At test time images are processed jointly through the RGB and hallucination networks to produce improved detection performance. Thus, our method transfers information commonly extracted from depth training data to a network which can extract that information from the RGB counterpart. We present results on the standard NYUDv2 dataset and report improvement on the RGB detection task.

@inproceedings{hoffman2016learning,
author = "Hoffman, Judy and Gupta, Saurabh and Darrell, Trevor",
title = "Learning with side information through modality hallucination",
booktitle = "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition",
pages = "826--834",
year = "2016"
}

Cross-modal adaptation for RGB-D detection
Judy Hoffman, Saurabh Gupta, Jian Leong, Sergio Guadarrama, Trevor Darrell
International Conference on Robotics and Automation (ICRA), 2016
abstract / bibtex

In this paper we propose a technique to adapt convolutional neural network (CNN) based object detectors trained on RGB images to effectively leverage depth images at test time to boost detection performance. Given labeled depth images for a handful of categories we adapt an RGB object detector for a new category such that it can now use depth images in addition to RGB images at test time to produce more accurate detections. Our approach is built upon the observation that lower layers of a CNN are largely task and category agnostic and domain specific while higher layers are largely task and category specific while being domain agnostic. We operationalize this observation by proposing a mid-level fusion of RGB and depth CNNs. Experimental evaluation on the challenging NYUD2 dataset shows that our proposed adaptation technique results in an average 21% relative improvement in detection performance over an RGB-only baseline even when no depth training data is available for the particular category evaluated. We believe our proposed technique will extend advances made in computer vision to RGB-D data leading to improvements in performance at little additional annotation effort.

@inproceedings{hoffman2016cross,
author = "Hoffman, Judy and Gupta, Saurabh and Leong, Jian and Guadarrama, Sergio and Darrell, Trevor",
title = "Cross-modal adaptation for RGB-D detection",
booktitle = "Robotics and Automation (ICRA), 2016 IEEE International Conference on",
pages = "5032--5039",
year = "2016",
organization = "IEEE"
}

The three R's of computer vision: Recognition, reconstruction and reorganization
Jitendra Malik, Pablo Arbelaez, Joao Carreira, Katerina Fragkiadaki, Ross Girshick, Georgia Gkioxari, Saurabh Gupta, Bharath Hariharan, Abhishek Kar, Shubham Tulsiani
Pattern Recognition Letters, 2016
abstract / bibtex

We argue for the importance of the interaction between recognition, reconstruction and re-organization, and propose that as a unifying framework for computer vision. In this view, recognition of objects is reciprocally linked to re-organization, with bottom-up grouping processes generating candidates, which can be classified using top down knowledge, following which the segmentations can be refined again. Recognition of 3D objects could benefit from a reconstruction of 3D structure, and 3D reconstruction can benefit from object category-specific priors. We also show that reconstruction of 3D structure from video data goes hand in hand with the reorganization of the scene. We demonstrate pipelined versions of two systems, one for RGB-D images, and another for RGB images, which produce rich 3D scene interpretations in this framework.

@article{malik2016three,
author = "Malik, Jitendra and Arbel{\'a}ez, Pablo and Carreira, Joao and Fragkiadaki, Katerina and Girshick, Ross and Gkioxari, Georgia and Gupta, Saurabh and Hariharan, Bharath and Kar, Abhishek and Tulsiani, Shubham",
title = "The three R's of computer vision: Recognition, reconstruction and reorganization",
journal = "Pattern Recognition Letters",
volume = "72",
pages = "4--14",
year = "2016",
publisher = "North-Holland"
}

2015

Aligning 3D models to RGB-D images of cluttered scenes
Saurabh Gupta, Pablo Arbelaez, Ross Girshick, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2015
abstract / bibtex / poster / arXiv link

The goal of this work is to represent objects in an RGB-D scene with corresponding 3D models from a library. We approach this problem by first detecting and segmenting object instances in the scene and then using a convolutional neural network (CNN) to predict the pose of the object. This CNN is trained using pixel surface normals in images containing renderings of synthetic objects. When tested on real data, our method outperforms alternative algorithms trained on real data. We then use this coarse pose estimate along with the inferred pixel support to align a small number of prototypical models to the data, and place into the scene the model that fits best. We observe a 48% relative improvement in performance at the task of 3D detection over the current state-of-the-art, while being an order of magnitude faster.

@inproceedings{gupta2015aligning,
author = "Gupta, Saurabh and Arbel{\'a}ez, Pablo and Girshick, Ross and Malik, Jitendra",
title = "Aligning 3D models to RGB-D images of cluttered scenes",
booktitle = "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition",
pages = "4731--4740",
year = "2015"
}

Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation
Saurabh Gupta, Pablo Arbelaez, Ross Girshick, Jitendra Malik
International Journal of Computer Vision (IJCV), 2015
abstract / bibtex / code / dev code

In this paper, we address the problems of contour detection, bottom-up grouping, object detection and semantic segmentation on RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset. We propose algorithms for object boundary detection and hierarchical segmentation that generalize the gPb-ucm approach by making effective use of depth information. We show that our system can label each contour with its type (depth, normal or albedo). We also propose a generic method for long-range amodal completion of surfaces and show its effectiveness in grouping. We train RGB-D object detectors by analyzing and computing Histogram of Oriented Gradients (HOG) on the depth image and using them with deformable part models (DPM). We observe that this simple strategy for training object detectors significantly outperforms more complicated models in the literature. We then turn to the problem of semantic segmentation for which we propose an approach that classifies superpixels into the dominant object categories in the NYUD2 dataset. We design generic and class-specific features to encode the appearance and geometry of objects. We also show that additional features computed from RGB-D object detectors and scene classifiers further improves semantic segmentation accuracy. In all of these tasks, we report significant improvements over the state-of-the-art.

@article{gupta2015indoor,
author = "Gupta, Saurabh and Arbel{\'a}ez, Pablo and Girshick, Ross and Malik, Jitendra",
title = "Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation",
journal = "International Journal of Computer Vision",
volume = "112",
number = "2",
pages = "133--149",
year = "2015",
publisher = "Springer US"
}

Visual semantic role labeling
Saurabh Gupta, Jitendra Malik
arXiv, 2015
abstract / bibtex / v-coco dataset / arXiv link

In this paper we introduce the problem of Visual Semantic Role Labeling: given an image we want to detect people doing actions and localize the objects of interaction. Classical approaches to action recognition either study the task of action classification at the image or video clip level or at best produce a bounding box around the person doing the action. We believe such an output is inadequate and a complete understanding can only come when we are able to associate objects in the scene to the different semantic roles of the action. To enable progress towards this goal, we annotate a dataset of 16K people instances in 10K images with actions they are doing and associate objects in the scene with different semantic roles for each action. Finally, we provide a set of baseline algorithms for this task and analyze error modes providing directions for future work.

@article{gupta2015visual,
author = "Gupta, Saurabh and Malik, Jitendra",
title = "Visual semantic role labeling",
journal = "arXiv preprint arXiv:1505.04474",
year = "2015"
}

Exploring person context and local scene context for object detection
Saurabh Gupta*, Bharath Hariharan*, Jitendra Malik
arXiv, 2015
abstract / bibtex

In this paper we explore two ways of using context for object detection. The first model focusses on people and the objects they commonly interact with, such as fashion and sports accessories. The second model considers more general object detection and uses the spatial relationships between objects and between objects and scenes. Our models are able to capture precise spatial relationships between the context and the object of interest, and make effective use of the appearance of the contextual region. On the newly released COCO dataset, our models provide relative improvements of up to 5% over CNN-based state-of-the-art detectors, with the gains concentrated on hard cases such as small objects (10% relative improvement).

@article{gupta2015exploring,
author = "Gupta*, Saurabh and Hariharan*, Bharath and Malik, Jitendra",
title = "Exploring person context and local scene context for object detection",
journal = "arXiv preprint arXiv:1511.08177",
year = "2015"
}

From captions to visual concepts and back
Hao Fang*, Saurabh Gupta*, Forrest Iandola*, Rupesh Srivastava*, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John Platt, Lawrence C Zitnick, Geoffrey Zweig
Computer Vision and Pattern Recognition (CVPR), 2015
abstract / bibtex / extended abstract / slides / webpage / visual concept detection code / arXiv link / COCO leader board / blog / poster

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a BLEU-4 score of 29.1%. When human judges compare the system captions to ones written by other people on our heldout test set, the system captions have equal or better quality 34% of the time.

@inproceedings{fang2015captions,
author = "Fang*, Hao and Gupta*, Saurabh and Iandola*, Forrest and Srivastava*, Rupesh K and Deng, Li and Doll{\'a}r, Piotr and Gao, Jianfeng and He, Xiaodong and Mitchell, Margaret and Platt, John C and Zitnick, C Lawrence and Zweig, Geoffrey",
title = "From captions to visual concepts and back",
booktitle = "Proceedings of the IEEE conference on computer vision and pattern recognition",
pages = "1473--1482",
year = "2015"
}

Language models for image captioning: The quirks and what works
Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, Margaret Mitchell
Association for Computational Linguistics (ACL), 2015
abstract / bibtex / arXiv link

Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits of these different language modeling approaches for the first time by using the same state-ofthe-art CNN as input. We examine issues in the different approaches, including linguistic irregularities, caption repetition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the benchmark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.

@article{devlin2015language,
author = "Devlin, Jacob and Cheng, Hao and Fang, Hao and Gupta, Saurabh and Deng, Li and He, Xiaodong and Zweig, Geoffrey and Mitchell, Margaret",
title = "Language models for image captioning: The quirks and what works",
journal = "arXiv preprint arXiv:1505.01809",
year = "2015"
}

Exploring nearest neighbor approaches for image captioning
Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, Lawrence C Zitnick
arXiv, 2015
abstract / bibtex / arXiv link

We explore a variety of nearest neighbor baseline approaches for image captioning. These approaches find a set of nearest neighbor images in the training set from which a caption may be borrowed for the query image. We select a caption for the query image by finding the caption that best represents the "consensus" of the set of candidate captions gathered from the nearest neighbor images. When measured by automatic evaluation metrics on the MS COCO caption evaluation server, these approaches perform as well as many recent approaches that generate novel captions. However, human studies show that a method that generates novel captions is still preferred over the nearest neighbor approach.

@article{devlin2015exploring,
author = "Devlin, Jacob and Gupta, Saurabh and Girshick, Ross and Mitchell, Margaret and Zitnick, C Lawrence",
title = "Exploring nearest neighbor approaches for image captioning",
journal = "arXiv preprint arXiv:1505.04467",
year = "2015"
}

Microsoft COCO captions: Data collection and evaluation server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, Lawrence C Zitnick
arXiv, 2015
abstract / bibtex / arXiv link / code

In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When completed, the dataset will contain over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions will be provided. To ensure consistency in evaluation of automatic caption generation algorithms, an evaluation server is used. The evaluation server receives candidate captions and scores them using several popular metrics, including BLEU, METEOR, ROUGE and CIDEr. Instructions for using the evaluation server are provided.

@article{chen2015microsoft,
author = "Chen, Xinlei and Fang, Hao and Lin, Tsung-Yi and Vedantam, Ramakrishna and Gupta, Saurabh and Doll{\'a}r, Piotr and Zitnick, C Lawrence",
title = "Microsoft COCO captions: Data collection and evaluation server",
journal = "arXiv preprint arXiv:1504.00325",
year = "2015"
}

2014

Learning rich features from RGB-D images for object detection and segmentation
Saurabh Gupta, Ross Girshick, Pablo Arbelaez, Jitendra Malik
European Conference on Computer Vision (ECCV), 2014
abstract / bibtex / supplementary material / slides / poster / code / pretrained NYUD2 models / pretrained SUN RGB-D models

In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.

@inproceedings{gupta2014learning,
author = "Gupta, Saurabh and Girshick, Ross and Arbel{\'a}ez, Pablo and Malik, Jitendra",
title = "Learning rich features from RGB-D images for object detection and segmentation",
booktitle = "European Conference on Computer Vision (ECCV)",
pages = "345--360",
year = "2014",
organization = "Springer, Cham"
}

2013

Perceptual organization and recognition of indoor scenes from RGB-D images
Saurabh Gupta, Pablo Arbelaez, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2013
abstract / bibtex / supp / code / dev code / data / poster / slides

We address the problems of contour detection, bottomup grouping and semantic segmentation using RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset. We propose algorithms for object boundary detection and hierarchical segmentation that generalize the gPb-ucm approach by making effective use of depth information. We show that our system can label each contour with its type (depth, normal or albedo). We also propose a generic method for long-range amodal completion of surfaces and show its effectiveness in grouping. We then turn to the problem of semantic segmentation and propose a simple approach that classifies superpixels into the 40 dominant object categories in NYUD2. We use both generic and class-specific features to encode the appearance and geometry of objects. We also show how our approach can be used for scene classification, and how this contextual information in turn improves object recognition. In all of these tasks, we report signifi- cant improvements over the state-of-the-art.

@inproceedings{gupta2013perceptual,
author = "Gupta, Saurabh and Arbelaez, Pablo and Malik, Jitendra",
title = "Perceptual organization and recognition of indoor scenes from RGB-D images",
booktitle = "Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on",
pages = "564--571",
year = "2013",
organization = "IEEE"
}

A Data Driven Approach for Algebraic Loop Invariants.
Rahul Sharma, Saurabh Gupta, Bharath Hariharan, Alex Aiken, Percy Liang, Aditya Nori
European Symposium on Programming (ESOP), 2013
abstract / bibtex

We describe a Guess-and-Check algorithm for computing algebraic equation invariants. The 'guess' phase is data driven and derives a candidate invariant from data generated from concrete executions of the program. This candidate invariant is subsequently validated in a 'check' phase by an off-the-shelf SMT solver. Iterating between the two phases leads to a sound algorithm. Moreover, we are able to prove a bound on the number of decision procedure queries which Guess-and-Check requires to obtain a sound invariant. We show how Guess-and-Check can be extended to generate arbitrary boolean combinations of linear equalities as invariants, which enables us to generate expressive invariants to be consumed by tools that cannot handle non-linear arithmetic. We have evaluated our technique on a number of benchmark programs from recent papers on invariant generation. Our results are encouraging - we are able to efficiently compute algebraic invariants in all cases, with only a few tests.

@inproceedings{sharma2013data,
author = "Sharma, Rahul and Gupta, Saurabh and Hariharan, Bharath and Aiken, Alex and Liang, Percy and Nori, Aditya V",
title = "A Data Driven Approach for Algebraic Loop Invariants.",
booktitle = "ESOP",
volume = "13",
pages = "574--592",
year = "2013"
}

Verification as learning geometric concepts
Rahul Sharma, Saurabh Gupta, Bharath Hariharan, Alex Aiken, Aditya Nori
Static Analysis Symposium (SAS), 2013
abstract / bibtex

We formalize the problem of program verification as a learning problem, showing that invariants in program verification can be regarded as geometric concepts in machine learning. Safety properties define bad states: states a program should not reach. Program verification explains why a program's set of reachable states is disjoint from the set of bad states. In Hoare Logic, these explanations are predicates that form inductive assertions. Using samples for reachable and bad states and by applying well known machine learning algorithms for classification, we are able to generate inductive assertions. By relaxing the search for an exact proof to classifiers, we obtain complexity theoretic improvements. Further, we extend the learning algorithm to obtain a sound procedure that can generate proofs containing invariants that are arbitrary boolean combinations of polynomial inequalities. We have evaluated our approach on a number of challenging benchmarks and the results are promising.

@inproceedings{sharma2013verification,
author = "Sharma, Rahul and Gupta, Saurabh and Hariharan, Bharath and Aiken, Alex and Nori, Aditya V",
title = "Verification as learning geometric concepts",
booktitle = "International Static Analysis Symposium",
pages = "388--411",
year = "2013",
organization = "Springer, Berlin, Heidelberg"
}

2012

Semantic segmentation using regions and parts
Pablo Arbelaez, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir Bourdev, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2012
abstract / bibtex

We address the problem of segmenting and recognizing objects in real world images, focusing on challenging articulated categories such as humans and other animals. For this purpose, we propose a novel design for region-based object detectors that integrates efficiently top-down information from scanning-windows part models and global appearance cues. Our detectors produce class-specific scores for bottom-up regions, and then aggregate the votes of multiple overlapping candidates through pixel classification. We evaluate our approach on the PASCAL segmentation challenge, and report competitive performance with respect to current leading techniques. On VOC2010, our method obtains the best results in 6/20 categories and the highest performance on articulated objects.

@inproceedings{arbelaez2012semantic,
author = "Arbel{\'a}ez, Pablo and Hariharan, Bharath and Gu, Chunhui and Gupta, Saurabh and Bourdev, Lubomir and Malik, Jitendra",
title = "Semantic segmentation using regions and parts",
booktitle = "Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on",
pages = "3378--3385",
year = "2012",
organization = "IEEE"
}

Funding Acknowledgements