Saurabh Gupta

I am an Assistant Professor at UIUC. Before this, I was a Research Scientist at Facebook AI Research in Pittsburgh working with Prof. Abhinav Gupta. Earlier, I was a Computer Science graduate student at UC Berkeley, where I was advised by Prof. Jitendra Malik. Even earlier, I was an under graduate at IIT Delhi, where I majored in Computer Science and Engineering.

Prospective Students: If you are interested in working with me, please directly apply through the CS or ECE departments and mention my name.

E Mail / CSL 319 / CV / Scholar / Github / PhD Thesis
Teaching
ECE 598 SG: Special Topics in Learning-based Robotics
Fall 2020
ECE 549 / CS 543: Computer Vision
Spring 2020
ECE 598 SG: Special Topics in Learning-based Robotics
Fall 2019


Publications

2020
Semantic Visual Navigation by Watching YouTube Videos
Matthew Chang, Arjun Gupta, Saurabh Gupta
arXiv, 2020
abstract / bibtex / webpage / arxiv link / video

Semantic cues and statistical regularities in real-world environment layouts can improve efficiency for navigation in novel environments. This paper learns and leverages such semantic cues for navigating to objects of interest in novel environments, by simply watching YouTube videos. This is challenging because YouTube videos do not come with labels for actions or goals, and may not even showcase optimal behavior. Our proposed method tackles these challenges through the use of Q-learning on pseudo-labeled transition quadruples (image, action, next image, reward). Our experiments in visually realistic simulations demonstrate that such off-policy Q-learning from passive data is able to learn meaningful semantic cues for navigation. These cues, when used in a hierarchical navigation policy, lead to improved efficiency for goal reaching, and are able to improve upon end-to-end RL based methods by 66%, while at the same time using 250 times fewer interaction samples. Code, dataset, and models will be made available.

@article{chang2020semantic,
author = "Chang, Matthew and Gupta, Arjun and Gupta, Saurabh",
title = "Semantic Visual Navigation by Watching YouTube Videos",
journal = "arXiv preprint arXiv:2006.10034",
year = "2020"
}
Semantic Curiosity for Active Visual Learning
Devendra Chaplot*, Helen Jiang*, Saurabh Gupta, Abhinav Gupta
European Conference on Computer Vision (ECCV), 2020
abstract / bibtex / webpage / arxiv link

In this paper, we study the task of embodied interactive learning for object detection. Given a set of environments (and some labeling budget), our goal is to learn an object detector by having an agent select what data to obtain labels for. How should an exploration policy decide which trajectory should be labeled? One possibility is to use a trained object detector's failure cases as an external reward. However, this will require labeling millions of frames required for training RL policies, which is infeasible. Instead, we explore a self-supervised approach for training our exploration policy by introducing a notion of semantic curiosity. Our semantic curiosity policy is based on a simple observation -- the detection outputs should be consistent. Therefore, our semantic curiosity rewards trajectories with inconsistent labeling behavior and encourages the exploration policy to explore such areas. The exploration policy trained via semantic curiosity generalizes to novel scenes and helps train an object detector that outperforms baselines trained with other possible alternatives such as random exploration, prediction-error curiosity, and coverage-maximizing exploration.

@inproceedings{chaplot2020semantic,
author = "Chaplot, Devendra Singh and Jiang, Helen and Gupta, Saurabh and Gupta, Abhinav",
title = "Semantic Curiosity for Active Visual Learning",
year = "2020",
booktitle = "European Conference on Computer Vision (ECCV)"
}
Aligning Videos in Space and Time
Senthil Purushwalkam*, Tian Ye*, Saurabh Gupta, Abhinav Gupta
European Conference on Computer Vision (ECCV), 2020
abstract / bibtex / arxiv link

In this paper, we focus on the task of extracting visual cor- respondences across videos. Given a query video clip from an action class, we aim to align it with training videos in space and time. Obtaining training data for such a fine-grained alignment task is challenging and often ambiguous. Hence, we propose a novel alignment procedure that learns such correspondence in space and time via cross video cycle-consistency. During training, given a pair of videos, we compute cycles that connect patches in a given frame in the first video by matching through frames in the second video. Cycles that connect overlapping patches together are encouraged to score higher than cycles that connect non-overlapping patches. Our experiments on the Penn Action and Pouring datasets demonstrate that the proposed method can successfully learn to correspond semantically similar patches across videos, and learns representations that are sensitive to object and action states.

@inproceedings{purushwalkam2020aligning,
author = "Purushwalkam, Senthil and Ye, Tian and Gupta, Saurabh and Gupta, Abhinav",
title = "Aligning Videos in Space and Time",
year = "2020",
booktitle = "European Conference on Computer Vision (ECCV)"
}
Neural Topological SLAM for Visual Navigation
Devendra Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, Saurabh Gupta
Computer Vision and Pattern Recognition (CVPR), 2020
abstract / bibtex / webpage / video

This paper studies the problem of image-goal navigation which involves navigating to the location indicated by a goal image in a novel previously unseen environment. To tackle this problem, we design topological representations for space that effectively leverage semantics and afford approximate geometric reasoning. At the heart of our representations are nodes with associated semantic features, that are interconnected using coarse geometric information. We describe supervised learning-based algorithms that can build, maintain and use such representations under noisy actuation. Experimental study in visually and physically realistic simulation suggests that our method builds effective representations that capture structural regularities and efficiently solve long-horizon navigation problems. We observe a relative improvement of more than 50% over existing methods that study this task.

@inproceedings{chaplot2020neural,
author = "Chaplot, Devendra Singh and Salakhutdinov, Ruslan and Gupta, Abhinav and Gupta, Saurabh",
title = "Neural Topological SLAM for Visual Navigation",
year = "2020",
booktitle = "Computer Vision and Pattern Recognition (CVPR)"
}
Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects
Kiana Ehsani, Shubham Tulsiani, Saurabh Gupta, Ali Farhadi, Abhinav Gupta
Computer Vision and Pattern Recognition (CVPR), 2020
abstract / bibtex / webpage / arxiv link / code+data

When we humans look at a video of human-object interaction, we can not only infer what is happening but we can even extract actionable information and imitate those interactions. On the other hand, current recognition or geometric approaches lack the physicality of action representation. In this paper, we take a step towards a more physical understanding of actions. We address the problem of inferring contact points and the physical forces from videos of humans interacting with objects. One of the main challenges in tackling this problem is obtaining ground-truth labels for forces. We sidestep this problem by instead using a physics simulator for supervision. Specifically, we use a simulator to predict effects and enforce that estimated forces must lead to the same effect as depicted in the video. Our quantitative and qualitative results show that (a) we can predict meaningful forces from videos whose effects lead to accurate imitation of the motions observed, (b) by jointly optimizing for contact point and force prediction, we can improve the performance on both tasks in comparison to independent training, and (c) we can learn a representation from this model that generalizes to novel objects using few shot examples.

@inproceedings{ehsani2020force,
author = "Ehsani, Kiana and Tulsiani, Shubham and Gupta, Saurabh and Farhadi, Ali and Gupta, Abhinav",
title = "Use the Force, Luke! Learning to Predict Physical Forces by Simulating Effects",
year = "2020",
booktitle = "Computer Vision and Pattern Recognition (CVPR)"
}
Through Fog High Resolution Imaging Using Millimeter Wave Radar
Junfeng Guan, Sohrab Madani, Suraj Jog, Saurabh Gupta, Haitham Hassanieh
Computer Vision and Pattern Recognition (CVPR), 2020
abstract / bibtex / website

This paper demonstrates high-resolution imaging using millimeter wave (mmWave) radars that can function even in dense fog. We leverage the fact that mmWave signals have favorable propagation characteristics in low visibility conditions, unlike optical sensors like cameras and LiDARs which cannot penetrate through dense fog. Millimeter wave radars, however, suffer from very low resolution, specularity, and noise artifacts. We introduce HawkEye, a system that leverages a cGAN architecture to recover high-frequency shapes from raw low-resolution mmWave heatmaps. We propose a novel design that addresses challenges specific to the structure and nature of the radar signals involved. We also develop a data synthesizer to aid with large-scale dataset generation for training. We implement our system on a custom-built mmWave radar platform and demonstrate performance improvement over both standard mmWave radars and other competitive baselines.

@inproceedings{guan2020through,
author = "Guan, Junfeng and Madani, Sohrab and Jog, Suraj and Gupta, Saurabh and Hassanieh, Haitham",
title = "Through Fog High Resolution Imaging Using Millimeter Wave Radar",
year = "2020",
booktitle = "Computer Vision and Pattern Recognition (CVPR)"
}
Efficient Bimanual Manipulation Using Learned Task Schemas
Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, Abhinav Gupta
International Conference on Robotics and Automation (ICRA), 2020
abstract / bibtex / video

We address the problem of effectively composing skills to solve sparse-reward tasks in the real world. Given a set of parameterized skills (such as exerting a force or doing a top grasp at a location), our goal is to learn policies that invoke these skills to efficiently solve such tasks. Our insight is that for many tasks, the learning process can be decomposed into learning a state-independent task schema (a sequence of skills to execute) and a policy to choose the parameterizations of the skills in a state-dependent manner. For such tasks, we show that explicitly modeling the schema's state-independence can yield significant improvements in sample efficiency for model-free reinforcement learning algorithms. Furthermore, these schemas can be transferred to solve related tasks, by simply re-learning the parameterizations with which the skills are invoked. We find that doing so enables learning to solve sparse-reward tasks on real-world robotic systems very efficiently. We validate our approach experimentally over a suite of robotic bimanual manipulation tasks, both in simulation and on real hardware.

@inproceedings{chitnis2020efficient,
author = "Chitnis, Rohan and Tulsiani, Shubham and Gupta, Saurabh and Gupta, Abhinav",
title = "Efficient Bimanual Manipulation Using Learned Task Schemas",
booktitle = "International Conference on Robotics and Automation",
year = "2020"
}
Intrinsic Motivation for Encouraging Synergistic Behavior
Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, Abhinav Gupta
International Conference on Learning Representations (ICLR), 2020
abstract / bibtex / webpage

We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks, which are tasks where multiple agents must work together to achieve a goal they could not individually. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own. Thus, we propose to incentivize agents to take (joint) actions whose effects cannot be predicted via a composition of the predicted effect for each individual agent. We study two instantiations of this idea, one based on the true states encountered, and another based on a dynamics model trained concurrently with the policy. While the former is simpler, the latter has the benefit of being analytically differentiable with respect to the action taken. We validate our approach in robotic bimanual manipulation tasks with sparse rewards; we find that our approach yields more efficient learning than both 1) training with only the sparse reward and 2) using the typical surprise-based formulation of intrinsic motivation, which does not bias toward synergistic behavior. Videos are available on the project webpage: https://sites.google.com/view/iclr2020-synergistic.

@inproceedings{chitnis2020intrinsic,
author = "Chitnis, Rohan and Tulsiani, Shubham and Gupta, Saurabh and Gupta, Abhinav",
title = "Intrinsic Motivation for Encouraging Synergistic Behavior",
booktitle = "International Conference on Learning Representations",
year = "2020",
url = "https://openreview.net/forum?id=SJleNCNtDH"
}
Learning to Explore Using Active Neural Mapping
Devendra Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, Ruslan Salakhutdinov
International Conference on Learning Representations (ICLR), 2020
abstract / bibtex / webpage / slides / openreview link / video / code

This work presents a modular and hierarchical approach to learn policies for exploring 3D environments. Our approach leverages the strengths of both classical and learning-based methods, by using analytical path planners with learned mappers, and global and local policies. Use of learning provides flexibility with respect to input modalities (in mapper), leverages structural regularities of the world (in global policies), and provides robustness to errors in state estimation (in local policies). Such use of learning within each module retains its benefits, while at the same time, hierarchical decomposition and modular training allow us to sidestep the high sample complexities associated with training end-to-end policies. Our experiments in visually and physically realistic simulated 3D environments demonstrate the effectiveness of our proposed approach over past learning and geometry-based approaches.

@inproceedings{chaplot2020learning,
author = "Chaplot, Devendra Singh and Gupta, Saurabh and Gandhi, Dhiraj and Gupta, Abhinav and Salakhutdinov, Ruslan",
title = "Learning To Explore Using Active Neural Mapping",
booktitle = "International Conference on Learning Representations",
year = "2020",
url = "https://openreview.net/pdf?id=HklXn1BKDH"
}
Learning to Move With Affordance Maps
William Qi, Ravi Mullapudi, Saurabh Gupta, Deva Ramanan
International Conference on Learning Representations (ICLR), 2020
abstract / bibtex / code / openreview link

The ability to autonomously explore and navigate a physical space is a fundamental requirement for virtually any mobile autonomous agent, from household robotic vacuums to autonomous vehicles. Traditional SLAM-based approaches for exploration and navigation largely focus on leveraging scene geometry, but fail to model dynamic objects (such as other agents) or semantic constraints (such as wet floors or doorways). Learning-based RL agents are an attractive alternative because they can incorporate both semantic and geometric information, but are notoriously sample inefficient, difficult to generalize to novel settings, and are difficult to interpret. In this paper, we combine the best of both worlds with a modular approach that {\em learns} a spatial representation of a scene that is trained to be effective when coupled with traditional geometric planners. Specifically, we design an agent that learns to predict a spatial affordance map that elucidates what parts of a scene are navigable through active self-supervised experience gathering. In contrast to most simulation environments that assume a static world, we evaluate our approach in the VizDoom simulator, using large-scale randomly-generated maps containing a variety of dynamic actors and hazards. We show that learned affordance maps can be used to augment traditional approaches for both exploration and navigation, providing significant improvements in performance.

@inproceedings{qi2020learning,
author = "Qi, William and Mullapudi, Ravi Teja and Gupta, Saurabh and Ramanan, Deva",
title = "Learning to Move with Affordance Maps",
booktitle = "International Conference on Learning Representations",
year = "2020",
url = "https://openreview.net/pdf?id=BJgMFxrYPB"
}

2019
Learning Navigation Subroutines by Watching Videos
Ashish Kumar, Saurabh Gupta, Jitendra Malik
Conference on Robot Learning (CoRL), 2019
abstract / bibtex / website / arXiv link / code

Hierarchies are an effective way to boost sample efficiency in reinforcement learning, and computational efficiency in classical planning. However, acquiring hierarchies via hand-design (as in classical planning) is suboptimal, while acquiring them via end-to-end reward based training (as in reinforcement learning) is unstable and still prohibitively expensive. In this paper, we pursue an alternate paradigm for acquiring such hierarchical abstractions (or visuo-motor subroutines), via use of passive first-person observation data. We use an inverse model trained on small amounts of interaction data to pseudo-label the passive first person videos with agent actions. Visuo-motor subroutines are acquired from these pseudo-labeled videos by learning a latent intent-conditioned policy that predicts the inferred pseudo-actions from the corresponding image observations. We demonstrate our proposed approach in context of navigation, and show that we can successfully learn consistent and diverse visuo-motor subroutines from passive first-person videos. We demonstrate the utility of our acquired visuo-motor subroutines by using them as is for exploration, and as sub-policies in a hierarchical RL framework for reaching point goals and semantic goals. We also demonstrate behavior of our subroutines in the real world, by deploying them on a real robotic platform.

@inproceedings{kumar2019learning,
author = "Kumar, Ashish and Gupta, Saurabh and Malik, Jitendra",
title = "Learning Navigation Subroutines by Watching Videos",
booktitle = "Conference on Robot Learning",
year = "2019"
}
Combining Optimal Control and Learning for Visual Navigation in Novel Environments
Somil Bansal, Varun Tolani, Saurabh Gupta, Jitendra Malik, Claire Tomlin
Conference on Robot Learning (CoRL), 2019
abstract / bibtex / website / code

Model-based control is a popular paradigm for robot navigation because it can leverage a known dynamics model to efficiently plan robust robot trajectories. However, it is challenging to use model-based methods in settings where the environment is apriori unknown and can only be observed partially through on-board sensors on the robot. In this work, we address this short-coming by coupling model-based control with learning-based perception. The learning-based perception module produces a series of waypoints that guide the robot to the goal via a collision-free path. These waypoints are used by a model-based planner to generate a smooth and dynamically feasible trajectory that is executed on the physical system using feedback control. Our experiments in simulated real-world cluttered environments and on an actual ground vehicle demonstrate that the proposed approach can reach goal locations more reliably and efficiently in novel, previously-unknown environments as compared to a purely end-to-end learning-based alternative. Our approach, which we refer to as WayPtNav (WayPoint-based Navigation), is successfully able to exhibit goal-driven behavior without relying on detailed explicit 3D maps of the environment, works well with low frame rates, and generalizes well from simulation to the real world.

@inproceedings{bansal2019combining,
author = "Bansal, Somil and Tolani, Varun and Gupta, Saurabh and Malik, Jitendra and Tomlin, Claire",
title = "Combining Optimal Control and Learning for Visual Navigation in Novel Environments",
booktitle = "Conference on Robot Learning",
year = "2019"
}
Learning Exploration Policies for Navigation
Tao Chen, Saurabh Gupta, Abhinav Gupta
International Conference on Learning Representations (ICLR), 2019
abstract / bibtex / website / arXiv link

Numerous past works have tackled the problem of task-driven navigation. But, how to effectively explore a new environment to enable a variety of down-stream tasks has received much less attention. In this work, we study how agents can autonomously explore realistic and complex 3D environments without the context of task-rewards. We propose a learning-based approach and investigate different policy architectures, reward functions, and training paradigms. We find that use of policies with spatial memory that are bootstrapped with imitation learning and finally finetuned with coverage rewards derived purely from on-board sensors can be effective at exploring novel environments. We show that our learned exploration policies can explore better than classical approaches based on geometry alone and generic learning-based exploration techniques. Finally, we also show how such task-agnostic exploration can be used for down-stream tasks. Videos are available at https://sites.google.com/view/exploration-for-nav/.

@inproceedings{chen2018learning,
author = "Chen, Tao and Gupta, Saurabh and Gupta, Abhinav",
title = "Learning Exploration Policies for Navigation",
booktitle = "International Conference on Learning Representations",
year = "2019",
url = "https://openreview.net/forum?id=SyMWn05F7"
}
Cognitive Mapping and Planning for Visual Navigation
Saurabh Gupta, Varun Tolani, James Davidson, Sergey Levine, Rahul Sukthankar, Jitendra Malik
International Journal of Computer Vision (IJCV), 2019
abstract / bibtex / website / arXiv link / code+simulation environment

We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. We train and test CMP on navigation problems in simulation environments derived from scans of real world buildings. Our experiments demonstrate that CMP outperforms alternate learning-based architectures, as well as, classical mapping and path planning approaches in many cases. Furthermore, it naturally extends to semantically specified goals, such as "going to a chair". We also deploy CMP on physical robots in indoor environments, where it achieves reasonable performance, even though it is trained entirely in simulation.

@article{gupta2019cognitive,
author = "Gupta, Saurabh and Tolani, Varun and Davidson, James and Levine, Sergey and Sukthankar, Rahul and Malik, Jitendra",
title = "Cognitive mapping and planning for visual navigation",
journal = "International Journal of Computer Vision",
year = "2019"
}
PyRobot: An Open-Source Robotics Framework for Research and Benchmarking
Adithyavairavan Murali*, Tao Chen*, Kalyan Vasudev Alwala*, Dhiraj Gandhi*, Lerrel Pinto, Saurabh Gupta, Abhinav Gupta
arXiv, 2019
abstract / bibtex / arXiv link / pyrobot code / locobot robot / pyrobot tutorial

This paper introduces PyRobot, an open-source robotics framework for research and benchmarking. PyRobot is a light-weight, high-level interface on top of ROS that provides a consistent set of hardware independent mid-level APIs to control different robots. PyRobot abstracts away details about low-level controllers and inter-process communication, and allows non-robotics researchers (ML, CV researchers) to focus on building high-level AI applications. PyRobot aims to provide a research ecosystem with convenient access to robotics datasets, algorithm implementations and models that can be used to quickly create a state-of-the-art baseline. We believe PyRobot, when paired up with low-cost robot platforms such as LoCoBot, will reduce the entry barrier into robotics, and democratize robotics. PyRobot is open-source, and can be accessed online.

@article{murali2019pyrobot,
author = "Murali*, Adithyavairavan and Chen*, Tao and Alwala*, Kalyan Vasudev and Gandhi*, Dhiraj and Pinto, Lerrel and Gupta, Saurabh and Gupta, Abhinav",
title = "PyRobot: An Open-source Robotics Framework for Research and Benchmarking",
journal = "arXiv preprint arXiv:1906.08236",
year = "2019"
}
Segmenting Unknown 3D Objects From Real Depth Images Using Mask R-Cnn Trained on Synthetic Point Clouds
Michael Danielczuk, Matthew Matl, Saurabh Gupta, Andrew Li, Andrew Lee, Jeffrey Mahler, Ken Goldberg
International Conference on Robotics and Automation (ICRA), 2019
abstract / bibtex / website / arXiv link / code

The ability to segment unknown objects in depth images has potential to enhance robot skills in grasping and object tracking. Recent computer vision research has demonstrated that Mask R-CNN can be trained to segment specific categories of objects in RGB images when massive hand-labeled datasets are available. As generating these datasets is time consuming, we instead train with synthetic depth images. Many robots now use depth sensors, and recent results suggest training on synthetic depth data can transfer successfully to the real world. We present a method for automated dataset generation and rapidly generate a synthetic training dataset of 50,000 depth images and 320,000 object masks using simulated heaps of 3D CAD models. We train a variant of Mask R-CNN with domain randomization on the generated dataset to perform category-agnostic instance segmentation without any hand-labeled data and we evaluate the trained network, which we refer to as Synthetic Depth (SD) Mask R-CNN, on a set of real, high-resolution depth images of challenging, densely-cluttered bins containing objects with highly-varied geometry. SD Mask R-CNN outperforms point cloud clustering baselines by an absolute 15% in Average Precision and 20% in Average Recall on COCO benchmarks, and achieves performance levels similar to a Mask R-CNN trained on a massive, hand-labeled RGB dataset and fine-tuned on real images from the experimental setup. We deploy the model in an instance-specific grasping pipeline to demonstrate its usefulness in a robotics application. Code, the synthetic training dataset, and supplementary material are available online.

@inproceedings{danielczuk2019segmenting,
author = "Danielczuk, Michael and Matl, Matthew and Gupta, Saurabh and Li, Andrew and Lee, Andrew and Mahler, Jeffrey and Goldberg, Ken",
title = "Segmenting unknown {3D} objects from real depth images using mask {R-CNN} trained on synthetic point clouds",
booktitle = "International Conference on Robotics and Automation",
year = "2019"
}

2018
Visual Memory for Robust Path Following
Ashish Kumar*, Saurabh Gupta*, David Fouhey, Sergey Levine, Jitendra Malik
Neural Information Processing Systems (NeurIPS), 2018
abstract / bibtex / webpage

Humans routinely retrace paths in a novel environment both forwards and backwards despite uncertainty in their motion. This paper presents an approach for doing so. Given a demonstration of a path, a first network generates a path abstraction. Equipped with this abstraction, a second network observes the world and decides how to act to retrace the path under noisy actuation and a changing environment. The two networks are optimized end-to-end at training time. We evaluate the method in two realistic simulators, performing path following and homing under actuation noise and environmental changes. Our experiments show that our approach outperforms classical approaches and other learning based baselines.

@inproceedings{kumar2018visual,
author = "Kumar*, Ashish and Gupta*, Saurabh and Fouhey, David and Levine, Sergey and Malik, Jitendra",
title = "Visual Memory for Robust Path Following",
booktitle = "Advances in Neural Information Processing Systems",
year = "2018"
}
Factoring Shape, Pose, and Layout From the 2D Image of a 3D Scene
Shubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei Efros, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2018
abstract / bibtex / webpage / arXiv link / code

The goal of this paper is to take a single 2D image of a scene and recover the 3D structure in terms of a small set of factors: a layout representing the enclosing surfaces as well as a set of objects represented in terms of shape and pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a large dataset of indoor scenes. Our experiments evaluate a number of practical design questions, demonstrate that we can infer this representation, and quantitatively and qualitatively demonstrate its merits compared to alternate representations.

@inproceedings{tulsiani2018factoring,
author = "Tulsiani, Shubham and Gupta, Saurabh and Fouhey, David and Efros, Alexei A and Malik, Jitendra",
title = "Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene",
booktitle = "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition",
year = "2018"
}
On Evaluation of Embodied Navigation Agents
Peter Anderson, Angel Chang, Devendra Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, Amir Zamir
arXiv, 2018
abstract / bibtex / arXiv link

Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study empirical methodology in navigation research. The present document summarizes the consensus recommendations of this working group. We discuss different problem statements and the role of generalization, present evaluation measures, and provide standard scenarios that can be used for benchmarking.

@article{anderson2018evaluation,
author = "Anderson, Peter and Chang, Angel and Chaplot, Devendra Singh and Dosovitskiy, Alexey and Gupta, Saurabh and Koltun, Vladlen and Kosecka, Jana and Malik, Jitendra and Mottaghi, Roozbeh and Savva, Manolis and Zamir, Amir",
title = "On Evaluation of Embodied Navigation Agents",
journal = "arXiv preprint arXiv:1807.06757",
year = "2018"
}

2017
Cognitive Mapping and Planning for Visual Navigation
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2017
abstract / bibtex / website / arXiv link / code+simulation environment

We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the planner, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a topdown belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. Our experiments demonstrate that CMP outperforms both reactive strategies and standard memory-based architectures and performs well in novel environments. Furthermore, we show that CMP can also achieve semantically specified goals, such as "go to a chair".

@inproceedings{gupta2017cognitive,
author = "Gupta, Saurabh and Davidson, James and Levine, Sergey and Sukthankar, Rahul and Malik, Jitendra",
title = "Cognitive mapping and planning for visual navigation",
booktitle = "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition",
year = "2017"
}
Unifying Map and Landmark Based Representations for Visual Navigation
Saurabh Gupta, David Fouhey, Sergey Levine, Jitendra Malik
arXiv, 2017
abstract / bibtex / webpage / arXiv link

This works presents a formulation for visual navigation that unifies map based spatial reasoning and path planning, with landmark based robust plan execution in noisy environments. Our proposed formulation is learned from data and is thus able to leverage statistical regularities of the world. This allows it to efficiently navigate in novel environments given only a sparse set of registered images as input for building representations for space. Our formula- tion is based on three key ideas: a learned path planner that outputs path plans to reach the goal, a feature synthesis engine that predicts features for locations along the planned path, and a learned goal-driven closed loop controller that can follow plans given these synthesized features. We test our approach for goal-driven navigation in simulated real world environments and report performance gains over competitive baseline approaches.

@article{gupta2017unifying,
author = "Gupta, Saurabh and Fouhey, David and Levine, Sergey and Malik, Jitendra",
title = "Unifying Map and Landmark based Representations for Visual Navigation",
journal = "arXiv preprint arXiv:1712.08125",
year = "2017"
}

2016
Cross Modal Distillation for Supervision Transfer
Saurabh Gupta, Judy Hoffman, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2016
abstract / bibtex / arXiv link / data / NYUD2 Detectors + Supervision Transfer Models

In this work we propose a technique that transfers supervision between images from different modalities. We use learned representations from a large labeled modality as supervisory signal for training representations for a new unlabeled paired modality. Our method enables learning of rich representations for unlabeled modalities and can be used as a pre-training procedure for new modalities with limited labeled data. We transfer supervision from labeled RGB images to unlabeled depth and optical flow images and demonstrate large improvements for both these cross modal supervision transfers.

@inproceedings{gupta2016cross,
author = "Gupta, Saurabh and Hoffman, Judy and Malik, Jitendra",
title = "Cross modal distillation for supervision transfer",
booktitle = "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition",
pages = "2827--2836",
year = "2016"
}
Learning With Side Information Through Modality Hallucination
Judy Hoffman, Saurabh Gupta, Trevor Darrell
Computer Vision and Pattern Recognition (CVPR), 2016
abstract / bibtex

We present a modality hallucination architecture for training an RGB object detection model which incorporates depth side information at training time. Our convolutional hallucination network learns a new and complementary RGB image representation which is taught to mimic convolutional mid-level features from a depth network. At test time images are processed jointly through the RGB and hallucination networks to produce improved detection performance. Thus, our method transfers information commonly extracted from depth training data to a network which can extract that information from the RGB counterpart. We present results on the standard NYUDv2 dataset and report improvement on the RGB detection task.

@inproceedings{hoffman2016learning,
author = "Hoffman, Judy and Gupta, Saurabh and Darrell, Trevor",
title = "Learning with side information through modality hallucination",
booktitle = "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition",
pages = "826--834",
year = "2016"
}
Cross-Modal Adaptation for RGB-D Detection
Judy Hoffman, Saurabh Gupta, Jian Leong, Sergio Guadarrama, Trevor Darrell
International Conference on Robotics and Automation (ICRA), 2016
abstract / bibtex

In this paper we propose a technique to adapt convolutional neural network (CNN) based object detectors trained on RGB images to effectively leverage depth images at test time to boost detection performance. Given labeled depth images for a handful of categories we adapt an RGB object detector for a new category such that it can now use depth images in addition to RGB images at test time to produce more accurate detections. Our approach is built upon the observation that lower layers of a CNN are largely task and category agnostic and domain specific while higher layers are largely task and category specific while being domain agnostic. We operationalize this observation by proposing a mid-level fusion of RGB and depth CNNs. Experimental evaluation on the challenging NYUD2 dataset shows that our proposed adaptation technique results in an average 21% relative improvement in detection performance over an RGB-only baseline even when no depth training data is available for the particular category evaluated. We believe our proposed technique will extend advances made in computer vision to RGB-D data leading to improvements in performance at little additional annotation effort.

@inproceedings{hoffman2016cross,
author = "Hoffman, Judy and Gupta, Saurabh and Leong, Jian and Guadarrama, Sergio and Darrell, Trevor",
title = "Cross-modal adaptation for RGB-D detection",
booktitle = "Robotics and Automation (ICRA), 2016 IEEE International Conference on",
pages = "5032--5039",
year = "2016",
organization = "IEEE"
}
The Three R's of Computer Vision: Recognition, Reconstruction and Reorganization
Jitendra Malik, Pablo Arbelaez, Joao Carreira, Katerina Fragkiadaki, Ross Girshick, Georgia Gkioxari, Saurabh Gupta, Bharath Hariharan, Abhishek Kar, Shubham Tulsiani
Pattern Recognition Letters, 2016
abstract / bibtex

We argue for the importance of the interaction between recognition, reconstruction and re-organization, and propose that as a unifying framework for computer vision. In this view, recognition of objects is reciprocally linked to re-organization, with bottom-up grouping processes generating candidates, which can be classified using top down knowledge, following which the segmentations can be refined again. Recognition of 3D objects could benefit from a reconstruction of 3D structure, and 3D reconstruction can benefit from object category-specific priors. We also show that reconstruction of 3D structure from video data goes hand in hand with the reorganization of the scene. We demonstrate pipelined versions of two systems, one for RGB-D images, and another for RGB images, which produce rich 3D scene interpretations in this framework.

@article{malik2016three,
author = "Malik, Jitendra and Arbel{\'a}ez, Pablo and Carreira, Joao and Fragkiadaki, Katerina and Girshick, Ross and Gkioxari, Georgia and Gupta, Saurabh and Hariharan, Bharath and Kar, Abhishek and Tulsiani, Shubham",
title = "The three R's of computer vision: Recognition, reconstruction and reorganization",
journal = "Pattern Recognition Letters",
volume = "72",
pages = "4--14",
year = "2016",
publisher = "North-Holland"
}

2015
Aligning 3D Models to RGB-D Images of Cluttered Scenes
Saurabh Gupta, Pablo Arbelaez, Ross Girshick, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2015
abstract / bibtex / arXiv link / poster

The goal of this work is to represent objects in an RGB-D scene with corresponding 3D models from a library. We approach this problem by first detecting and segmenting object instances in the scene and then using a convolutional neural network (CNN) to predict the pose of the object. This CNN is trained using pixel surface normals in images containing renderings of synthetic objects. When tested on real data, our method outperforms alternative algorithms trained on real data. We then use this coarse pose estimate along with the inferred pixel support to align a small number of prototypical models to the data, and place into the scene the model that fits best. We observe a 48% relative improvement in performance at the task of 3D detection over the current state-of-the-art, while being an order of magnitude faster.

@inproceedings{gupta2015aligning,
author = "Gupta, Saurabh and Arbel{\'a}ez, Pablo and Girshick, Ross and Malik, Jitendra",
title = "Aligning 3D models to RGB-D images of cluttered scenes",
booktitle = "Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition",
pages = "4731--4740",
year = "2015"
}
Indoor Scene Understanding With RGB-D Images: Bottom-Up Segmentation, Object Detection and Semantic Segmentation
Saurabh Gupta, Pablo Arbelaez, Ross Girshick, Jitendra Malik
International Journal of Computer Vision (IJCV), 2015
abstract / bibtex / code / dev code

In this paper, we address the problems of contour detection, bottom-up grouping, object detection and semantic segmentation on RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset. We propose algorithms for object boundary detection and hierarchical segmentation that generalize the gPb-ucm approach by making effective use of depth information. We show that our system can label each contour with its type (depth, normal or albedo). We also propose a generic method for long-range amodal completion of surfaces and show its effectiveness in grouping. We train RGB-D object detectors by analyzing and computing Histogram of Oriented Gradients (HOG) on the depth image and using them with deformable part models (DPM). We observe that this simple strategy for training object detectors significantly outperforms more complicated models in the literature. We then turn to the problem of semantic segmentation for which we propose an approach that classifies superpixels into the dominant object categories in the NYUD2 dataset. We design generic and class-specific features to encode the appearance and geometry of objects. We also show that additional features computed from RGB-D object detectors and scene classifiers further improves semantic segmentation accuracy. In all of these tasks, we report significant improvements over the state-of-the-art.

@article{gupta2015indoor,
author = "Gupta, Saurabh and Arbel{\'a}ez, Pablo and Girshick, Ross and Malik, Jitendra",
title = "Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation",
journal = "International Journal of Computer Vision",
volume = "112",
number = "2",
pages = "133--149",
year = "2015",
publisher = "Springer US"
}
Visual Semantic Role Labeling
Saurabh Gupta, Jitendra Malik
arXiv, 2015
abstract / bibtex / arXiv link / v-coco dataset

In this paper we introduce the problem of Visual Semantic Role Labeling: given an image we want to detect people doing actions and localize the objects of interaction. Classical approaches to action recognition either study the task of action classification at the image or video clip level or at best produce a bounding box around the person doing the action. We believe such an output is inadequate and a complete understanding can only come when we are able to associate objects in the scene to the different semantic roles of the action. To enable progress towards this goal, we annotate a dataset of 16K people instances in 10K images with actions they are doing and associate objects in the scene with different semantic roles for each action. Finally, we provide a set of baseline algorithms for this task and analyze error modes providing directions for future work.

@article{gupta2015visual,
author = "Gupta, Saurabh and Malik, Jitendra",
title = "Visual semantic role labeling",
journal = "arXiv preprint arXiv:1505.04474",
year = "2015"
}
Exploring Person Context and Local Scene Context for Object Detection
Saurabh Gupta*, Bharath Hariharan*, Jitendra Malik
arXiv, 2015
abstract / bibtex

In this paper we explore two ways of using context for object detection. The first model focusses on people and the objects they commonly interact with, such as fashion and sports accessories. The second model considers more general object detection and uses the spatial relationships between objects and between objects and scenes. Our models are able to capture precise spatial relationships between the context and the object of interest, and make effective use of the appearance of the contextual region. On the newly released COCO dataset, our models provide relative improvements of up to 5% over CNN-based state-of-the-art detectors, with the gains concentrated on hard cases such as small objects (10% relative improvement).

@article{gupta2015exploring,
author = "Gupta*, Saurabh and Hariharan*, Bharath and Malik, Jitendra",
title = "Exploring person context and local scene context for object detection",
journal = "arXiv preprint arXiv:1511.08177",
year = "2015"
}
From Captions to Visual Concepts and Back
Hao Fang*, Saurabh Gupta*, Forrest Iandola*, Rupesh Srivastava*, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John Platt, Lawrence C Zitnick, Geoffrey Zweig
Computer Vision and Pattern Recognition (CVPR), 2015
abstract / bibtex / slides / extended abstract / COCO leader board / webpage / poster / blog / arXiv link / visual concept detection code

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a BLEU-4 score of 29.1%. When human judges compare the system captions to ones written by other people on our heldout test set, the system captions have equal or better quality 34% of the time.

@inproceedings{fang2015captions,
author = "Fang*, Hao and Gupta*, Saurabh and Iandola*, Forrest and Srivastava*, Rupesh K and Deng, Li and Doll{\'a}r, Piotr and Gao, Jianfeng and He, Xiaodong and Mitchell, Margaret and Platt, John C and Zitnick, C Lawrence and Zweig, Geoffrey",
title = "From captions to visual concepts and back",
booktitle = "Proceedings of the IEEE conference on computer vision and pattern recognition",
pages = "1473--1482",
year = "2015"
}
Language Models for Image Captioning: The Quirks and What Works
Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, Margaret Mitchell
Association for Computational Linguistics (ACL), 2015
abstract / bibtex / arXiv link

Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits of these different language modeling approaches for the first time by using the same state-ofthe-art CNN as input. We examine issues in the different approaches, including linguistic irregularities, caption repetition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the benchmark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.

@article{devlin2015language,
author = "Devlin, Jacob and Cheng, Hao and Fang, Hao and Gupta, Saurabh and Deng, Li and He, Xiaodong and Zweig, Geoffrey and Mitchell, Margaret",
title = "Language models for image captioning: The quirks and what works",
journal = "arXiv preprint arXiv:1505.01809",
year = "2015"
}
Exploring Nearest Neighbor Approaches for Image Captioning
Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, Lawrence C Zitnick
arXiv, 2015
abstract / bibtex / arXiv link

We explore a variety of nearest neighbor baseline approaches for image captioning. These approaches find a set of nearest neighbor images in the training set from which a caption may be borrowed for the query image. We select a caption for the query image by finding the caption that best represents the "consensus" of the set of candidate captions gathered from the nearest neighbor images. When measured by automatic evaluation metrics on the MS COCO caption evaluation server, these approaches perform as well as many recent approaches that generate novel captions. However, human studies show that a method that generates novel captions is still preferred over the nearest neighbor approach.

@article{devlin2015exploring,
author = "Devlin, Jacob and Gupta, Saurabh and Girshick, Ross and Mitchell, Margaret and Zitnick, C Lawrence",
title = "Exploring nearest neighbor approaches for image captioning",
journal = "arXiv preprint arXiv:1505.04467",
year = "2015"
}
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, Lawrence C Zitnick
arXiv, 2015
abstract / bibtex / arXiv link / code

In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When completed, the dataset will contain over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions will be provided. To ensure consistency in evaluation of automatic caption generation algorithms, an evaluation server is used. The evaluation server receives candidate captions and scores them using several popular metrics, including BLEU, METEOR, ROUGE and CIDEr. Instructions for using the evaluation server are provided.

@article{chen2015microsoft,
author = "Chen, Xinlei and Fang, Hao and Lin, Tsung-Yi and Vedantam, Ramakrishna and Gupta, Saurabh and Doll{\'a}r, Piotr and Zitnick, C Lawrence",
title = "Microsoft COCO captions: Data collection and evaluation server",
journal = "arXiv preprint arXiv:1504.00325",
year = "2015"
}

2014
Learning Rich Features From RGB-D Images for Object Detection and Segmentation
Saurabh Gupta, Ross Girshick, Pablo Arbelaez, Jitendra Malik
European Conference on Computer Vision (ECCV), 2014
abstract / bibtex / code / supplementary material / poster / slides / pretrained SUN RGB-D models / pretrained NYUD2 models

In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.

@inproceedings{gupta2014learning,
author = "Gupta, Saurabh and Girshick, Ross and Arbel{\'a}ez, Pablo and Malik, Jitendra",
title = "Learning rich features from RGB-D images for object detection and segmentation",
booktitle = "European Conference on Computer Vision (ECCV)",
pages = "345--360",
year = "2014",
organization = "Springer, Cham"
}

2013
Perceptual Organization and Recognition of Indoor Scenes From RGB-D Images
Saurabh Gupta, Pablo Arbelaez, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2013
abstract / bibtex / code / dev code / supp / poster / slides / data

We address the problems of contour detection, bottomup grouping and semantic segmentation using RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset. We propose algorithms for object boundary detection and hierarchical segmentation that generalize the gPb-ucm approach by making effective use of depth information. We show that our system can label each contour with its type (depth, normal or albedo). We also propose a generic method for long-range amodal completion of surfaces and show its effectiveness in grouping. We then turn to the problem of semantic segmentation and propose a simple approach that classifies superpixels into the 40 dominant object categories in NYUD2. We use both generic and class-specific features to encode the appearance and geometry of objects. We also show how our approach can be used for scene classification, and how this contextual information in turn improves object recognition. In all of these tasks, we report signifi- cant improvements over the state-of-the-art.

@inproceedings{gupta2013perceptual,
author = "Gupta, Saurabh and Arbelaez, Pablo and Malik, Jitendra",
title = "Perceptual organization and recognition of indoor scenes from RGB-D images",
booktitle = "Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on",
pages = "564--571",
year = "2013",
organization = "IEEE"
}
A Data Driven Approach for Algebraic Loop Invariants.
Rahul Sharma, Saurabh Gupta, Bharath Hariharan, Alex Aiken, Percy Liang, Aditya Nori
European Symposium on Programming (ESOP), 2013
abstract / bibtex

We describe a Guess-and-Check algorithm for computing algebraic equation invariants. The 'guess' phase is data driven and derives a candidate invariant from data generated from concrete executions of the program. This candidate invariant is subsequently validated in a 'check' phase by an off-the-shelf SMT solver. Iterating between the two phases leads to a sound algorithm. Moreover, we are able to prove a bound on the number of decision procedure queries which Guess-and-Check requires to obtain a sound invariant. We show how Guess-and-Check can be extended to generate arbitrary boolean combinations of linear equalities as invariants, which enables us to generate expressive invariants to be consumed by tools that cannot handle non-linear arithmetic. We have evaluated our technique on a number of benchmark programs from recent papers on invariant generation. Our results are encouraging - we are able to efficiently compute algebraic invariants in all cases, with only a few tests.

@inproceedings{sharma2013data,
author = "Sharma, Rahul and Gupta, Saurabh and Hariharan, Bharath and Aiken, Alex and Liang, Percy and Nori, Aditya V",
title = "A Data Driven Approach for Algebraic Loop Invariants.",
booktitle = "ESOP",
volume = "13",
pages = "574--592",
year = "2013"
}
Verification as Learning Geometric Concepts
Rahul Sharma, Saurabh Gupta, Bharath Hariharan, Alex Aiken, Aditya Nori
Static Analysis Symposium (SAS), 2013
abstract / bibtex

We formalize the problem of program verification as a learning problem, showing that invariants in program verification can be regarded as geometric concepts in machine learning. Safety properties define bad states: states a program should not reach. Program verification explains why a program's set of reachable states is disjoint from the set of bad states. In Hoare Logic, these explanations are predicates that form inductive assertions. Using samples for reachable and bad states and by applying well known machine learning algorithms for classification, we are able to generate inductive assertions. By relaxing the search for an exact proof to classifiers, we obtain complexity theoretic improvements. Further, we extend the learning algorithm to obtain a sound procedure that can generate proofs containing invariants that are arbitrary boolean combinations of polynomial inequalities. We have evaluated our approach on a number of challenging benchmarks and the results are promising.

@inproceedings{sharma2013verification,
author = "Sharma, Rahul and Gupta, Saurabh and Hariharan, Bharath and Aiken, Alex and Nori, Aditya V",
title = "Verification as learning geometric concepts",
booktitle = "International Static Analysis Symposium",
pages = "388--411",
year = "2013",
organization = "Springer, Berlin, Heidelberg"
}

2012
Semantic Segmentation Using Regions and Parts
Pablo Arbelaez, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir Bourdev, Jitendra Malik
Computer Vision and Pattern Recognition (CVPR), 2012
abstract / bibtex

We address the problem of segmenting and recognizing objects in real world images, focusing on challenging articulated categories such as humans and other animals. For this purpose, we propose a novel design for region-based object detectors that integrates efficiently top-down information from scanning-windows part models and global appearance cues. Our detectors produce class-specific scores for bottom-up regions, and then aggregate the votes of multiple overlapping candidates through pixel classification. We evaluate our approach on the PASCAL segmentation challenge, and report competitive performance with respect to current leading techniques. On VOC2010, our method obtains the best results in 6/20 categories and the highest performance on articulated objects.

@inproceedings{arbelaez2012semantic,
author = "Arbel{\'a}ez, Pablo and Hariharan, Bharath and Gu, Chunhui and Gupta, Saurabh and Bourdev, Lubomir and Malik, Jitendra",
title = "Semantic segmentation using regions and parts",
booktitle = "Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on",
pages = "3378--3385",
year = "2012",
organization = "IEEE"
}