Generating Multi-hierarchy Scene Graphs for Human-instructed Manipulation Tasks in Open-world Settings

Sandeep S. Zachariah1*, Aman Tambi1*, Moksh Malhotra1, P. V. M. Rao1, Rohan Paul1
1Affiliated with Indian Institute of Technology Delhi, {*}Indicate equal contributions,

Abstract

For generating viable multi-step plans in robotics, it is necessary to have a representation scheme for scenes that is both open-set and structured in a way that facilitates local updates when the scene changes. We propose a method for generating open-world multi-perspective scene graphs using foundation models, which can support downstream planning tasks. We demonstrate that our method yields superior re- sults compared to previous works in both open-world object detection and relation extraction, even without any priors. Moreover, we illustrate how the multi-perspective nature of the scene graph aids the planner in devising feasible plans for tasks necessitating reasoning over the spatial arrangements and object category abstractions.

Approach Overview

Scene Graphs for Planning


Results

Multi-hierarchical Scene Graphs

The figure shows the scene graph that was generated by our approach for the given robot workspace. The scene graph has 3 perspectives to it - spatio-depth relations, planar relations, and category-wise abstraction. Each node in the scene graph represents an object in the workspace and has attributes such as color and pose. This scene graph supports the execution of tasks like “place the book that is to the right of the basket on the rack”(requires knowledge about the planar relation between the books and the basket and also the ‘onTop’ relation between book and glasses to generate viable plans), “give me something to eat”(requires object category abstraction).

Qualitative Comparison with Baselines

The following figure shows the scene graphs generated by both the proposed method and ConceptGraph, alongside the ground truth scene graph.