G2TR - Generalized Grounded Temporal Reasoning for Robot Instruction Following via Coupled Pre-trained Models

Riya Arora1*, Niveditha Narendranath1*, Aman Tambi1+, Sandeep S. Zachariah1+,
Souvik Chakraborty1, Rohan Paul1
1Affiliated with IIT Delhi, {*,+}Indicate equal contributions

Introduction

Consider the scenario where a human cleans a table and a robot observing the scene is instructed with the task “Robot, remove the cloth using which I wiped the table”. Instruction following with temporal reasoning requires the robot to identify the relevant past object interaction, ground the object of interest in the present scene, and execute the task according to the human’s instruction. Directly grounding utterances referencing past interactions to grounded objects is challenging due to the multi-hop nature of references to past interactions and large space of object groundings in a video stream observing the robot’s workspace. Our key insight is to factor the temporal reasoning task as (i) estimating the video interval associated with event reference, (ii) performing spatial reasoning over the interaction frames to infer the intended object (iii) semantically tracking the object’s location till the current scene to enable future robot interactions. Our approach leverages existing large pre-trained models (which possess inherent generalization capabilities) and combines them appropriately for temporal grounding tasks. Evaluation on a video-language corpus acquired with a robot manipulator displaying rich temporal interactions in spatially-complex scenes displays an average accuracy of 70.10% indicating the potential of G2TR in robot instruction following.

Technical Approach Overview

Generalized Grounded Temporal Reasoning (G2TR) Pipeline:
We address the temporal reasoning task by grounding the requisite interaction in the input-video for the candidate interval extraction, and then pinpointing the intended object through fine-grained spatial reasoning within this interval for grounding. Our framework comprises three components: (1) Temporal-Parser (TP) (2) Candidate interval estimation via Event-Localizer (EL), and (3) Grounding object of inter- est via Target-Detector (TD) followed by object re-acquisition via semantic tracking

(i) Temporal Parser: The parser formlulates the two key questions: a temporal question that determines "when" the described interaction occurs, and an object-identification question that specifiies "what" objects is involved in the interaction. Additionally, it also extracts the action that the robot needs to perform on the target object (as shown in figure above!). This is accomplished through in-context learning by providing a large language model (LLM) with input instruction.
(ii) Event Localization: This module performs temporal reasoning on accrued past observations to identify the likely interval of the required interaction.It takes in the entire video (with requisite object-interaction) and the temporal question from the previous module, and returns the specific time instant at which the specified interaction occured. This is implemented via a video-understanding vision language model.
(iii) Grounding Object of Interest: The purpose of this module is to perform fine-grained spatial reasoning and ground the target-object involved in the interaction. This module operates in three steps: (a) The input is sent to a Target Identifier (essentially an image-understanding vision language model) that extracts the class of the intended object that needs to be grounded. (b) A class- based detection returns the bounding box coordinates of all objects belonging to that class, thus providing visual options. (c) The Target Identifier (VLM) then identifies the target object by picking the intended target object from these visual options.
(iv) Grounding Propagation via Semantic Tracking: This additional module reasons over the grounded object state and location from the time of past interaction till the current world state for future robot manipulation. Again, a video-understanding vision language model, with tracking abilities is leveraged for this purpose

Evaluation Corpus

To assess the robot’s ability to perform grounded temporal reasoning, we form an evaluation corpus of 155 video-instruction pairs. The data set was collected in a table-top setting with a Franka Emika Robot Manipulator observing the scene via a eye-in-hand RGB-D camera. A total of 15 objects representating common household and healthcare items such as cups, bottles, medicines, fruits, notebooks, markers, handkerchiefs etc. were used. A total of 8 participants performed interactions such as pouring, placing, picking, stacking, replacing, wiping, dropping, repositioning, swapping etc. with 3 − 6 objects in each scene. The human participants also provided natural language instructions for the robot referencing to interactions and objects in the preceding interactions in the workspace. This resulted in a corpus of 155 video instruction pairs (each between 6 − 30 seconds). The instructions and interactions expressed natural diversity of spatial and temporal reasoning complexity. For detailed analysis, the evaluation corpus is bifurcated as: (i) single/multi hop temporal reasoning (56 : 99) (ii) simple/complex spatial grounding (98 : 57), (iii) single/multiple interactions (62 : 93) and (iv) partial or full observation of the referred object (36 : 119)



Reasoning Complexity Interaction Complexity
Single Hop
Instruction: Robot, pick the cloth that I just dropped!
Multi Hop
Instruction: Robot, give me
the object placed second!

Single Interaction
Instruction: Robot, give the cloth that was just placed!
Multi Interaction
Instruction: Robot, pick cup in which water was poured by orange bottle!
Spatial/Visual Complexity Observability
Spatially Simple
Instruction: Robot, remove the cloth used for wiping!
Spatially Complex
Instruction: Robot, point to the cup which was just placed on the table
Completely Observable
Instruction: Robot, pick the object that was just placed!
Partially Observable
Instruction: Robot, where
is the medicine?

Video Demonstrations

Demonstration 1 - A Simple Scenario: Video showing the robot executing the instruction "Pick the object that I just placed."
The robot in scene is a Franka Emika Arm Manipulator. There is a table in front of the robot, on which various objects (glasses, a cup, a small bottle) are already kept and the robot is observing this scene with its camera. Then a person places a banana on the table and an instruction to pick the object that was just placed is given, which the robot executes.


Demonstration 2- Visually complex scenario with multiple interactions: Video showing the robot carrying out instruction "Robot, remove the cloth that was used to wipe the table"
The robot observes a table with four cloths (two green, one blue, one pink) and a cleaning liquid. After a human rearranges the items and cleans with one cloth, the robot is instructed to remove the used cloth. It then uses the G2TR pipeline to identify and dispose of the correct cloth in the bin.


Demonstration 3 - Scenario involving partial observability : Video showing robot responding to the question "Robot, where is the marker that I just used?"
The robot, with its camera, observes a table with two books, two markers, and a bag. A human uses a marker, places it in a book, and then puts the book in the bag. The robot is instructed to identify and point to the partially occluded marker. Using the G2TR pipeline, it successfully identifies and points to the bag containing the marker.


Demonstration 4: Multi-hop reasoning scenario: Video showing the robot executing the instruction "Give me the bottle placed second by the human"
The robot, equipped with a camera, observes a shelf as a human sequentially places three bottles on it. Later, another individual instructs the robot to retrieve the second bottle placed. To execute this task, the robot leverages the G2TR pipeline, which performs multi-hop reasoning to identify the correct bottle and the robot successfully fulfills the command.


Prompting Details

To achieve generalized grounded temporal reasoning, G2TR utilizes the reasoning capabilities of several pre-trained large models. To ensure these models perform effectively and appropriately for the designated tasks, various prompts and prompting strategies were explored, as detailed below.

Temporal Parser: For this parser, an LLM [GPT-4] was employed. The prompt provided a comprehensive background on the robotic setting, along with a few-shot learning approach using in-context examples, enabling it to generate relevant questions for the subsequent modules.


example_parsing  = ["instruction" : "Pick the bottle that I just placed.",           
                    "response" : { 
                    "temporal_question": "When is placing by the person happening in the video? 
                    Give the exact timestamp.",
                    "object-interaction" : "the bottle that was placed by the person"       
                    " action": "Pick"},                

                    "instruction" : "Remove the object that was first eaten",           
                    "response" : { 
                    "temporal_question": "When is eating of first object by the person happening 
                    in the video? Give the exact timestamp. ",   
                    "object-interaction" : "object eaten by person"             
                    "action": “Point"},

                    "instruction" : "Where is the apple?",           
                    "response" : { 
                    "temporal_question": "When was the apple last seen? Give the exact timestamp.", 
                    “object-interaction" : "human-interaction happening with apple"               
                    "action": "Point to"},             

                    "instruction" : "Remove the cloth that was used by the boy",           
                    "response" : { 
                    "temporal_question": "When is the using of cloth by the boy happening in the 
                    video? Give the exact timestamp.",        
                    "object-interaction" : "cloth used by boy"        
                    "action": "Remove",}]  

There is a robot that needs to take a human instruction and formulate a temporal question as per instruction, 
identify the object-interaction of interest and what action to take . Given the human instruction, return a 
dictionary with 'temporal_question', 'object-interaction’, and 'action' as keys. For the 'object-interaction' 
key, remove the temporal aspect and clues such as 'at 2nd second' or 'last'. Return answer in JSON format always
                                              
          

Event Localizer (Candidate Interval Extraction): This module utilizes the temporal reasoning capabilities of the video-understanding VLM [CogVLM2-video], with the temporal question from the temporal parser serving as the prompt.

Grounding object of interest: This module employs the spatial reasoning and image understanding features of VLMs [GPT-4] along with the grounding capabilities of Phrase Grounding models [Grounding DINO 1.5 Pro] , using the object-interaction from the temporal parser as the initial prompt.

"From the given images, identify with which OBJECT the INTERACTION happened. Return that object only. The object 
interaction is : " + object-interaction (from temporal parser)
          
While asking the same image-understanding model to pick the target object from visual options, the prompt is slightly varied:

"From the given images, identify with which single OBJECT the INTERACTION happened. The last image given to you has 
visual options. Pick the right object number with which the INTERACTION is happening.Return object label."
          
In partial observability cases, the target object goes out of view. In order to track the occluding object, the module is re-prompted as:

"Where did " + object_class + " go? Give the object that partially or completely hid the " + object_class.
          

Example if the target object is a marker, which was kept inside a book, and the book was put in a bag, the prompts would be:

"Where did the marker go? Give the object that partially or completely hid the marker." 

## And once the marker goes out of view and the occluding object is determined to be a book:

"Where did the book go? Give the object that partially or completely hid the book."
          

Alternate Approaches (Baselines)

The proposed model, G2TR, was compared against two baseline approaches, each reflecting alternative methods for addressing the temporal reasoning problem, resulting in different architectures for combining pre-trained models. Only open-set models were included in the evaluation due to their generality.

G2TR: This model leverages a video-understanding VLM [CogVLM2-video] for candidate interval identification, followed by target detection and grounding using a Vision Language Model [GPT-4] and a Phrase Grounder [Grounding DINO 1.5 Pro] . It also integrates a tracker module [SAM 2] to handle environmental changes.


Direct Temporal Visual Grounding (DTVG): This approach uses a video-understanding VLM for both temporal and spatial reasoning, followed by a Phrase Grounder to identify the intended object, e.g., “green cloth on the left.”

Refined Temporal Visual Grounding (RTVG): Similar to DTVG, this method incorporates Visual Question Answering (VQA) by the VLM to iteratively refine the object description, ensuring precise grounding.

Experimental Insights

We have compared the performance of the proposed model G2TR with alternative approaches -Direct Temporal Visual Grounding (DTVG) and Refined Temporal Visual Grounding (RTVG). The proposed approach has better performance than suggested alternate approaches by 26%, in terms of overall average grounding accuracy. It also performs better than the alternate approaches in each of the different scenarios that are considered in our dataset bifurcation.

Analysis of our pipeline,G2TR , shows that the main bottleneck is the Event Localizer’s difficulty in precisely identifying the time instant of the language-referenced interaction. The next bottleneck occurs in the Target Detector, which struggles with selection of incorrect object class, insufficient visual prompts, or incorrect selection of object label due to ambiguity in visual options. The Tracker occasionally begins tracking the wrong object, when it appears identical to the correct one. While our approach performs well in complex scenarios, it struggles with rapid human-object interactions and single actions involving multiple objects, such as replacing or stacking. Another limitation of our pipeline is its tendency to perform reasoning based on linguistic cues rather than visual evidence. For instance, when tasked with identifying the object into which water was poured, it may incorrectly output a cup, even if the water was poured into a different container.


Component-Wise Analysis of Failure Cases (45/155)

GIF 4

Component-Wise Analysis frequency of failure cases generated by each module

Failure Case Examples

Instruction: Robot, pick bottle placed just before orange bottle
Image 1 Description

(a) Incorrect time instant (Expected: 5, G2TR: 12)

Instruction: Pick the container in which water was poured
Image 2 Description

(b) Incorrect Object Class (Expected: Bottle, G2TR: Cup)

Instruction: Pick the container by which water was poured
Image 3 Description

(c) Visual Options Ambiguity (Object labels overlapping)

Instruction: Robot, pick the cup under
which strawberry is hidden!
Image 4 Description

(d) Incorrect Object Tracking
(Wrong cup tracked)

Instruction: Robot, point to the cup which
was just used by the person
Image 6 Description

(e) Insufficient Grounding
(One cup not grounded)

Examples of Failure Cases displays instances of failure within G2TR framework.

Additional Insights

a. Leveraging a combination of specialized temporal and spatial video and image-understanding models can provide deeper contextual understanding compared to only a general-purpose image-understanding vision-language model.
Interactions are inherently temporal, and videos capture the full sequence, preserving context and flow. Converting this to the image domain requires sub-sampling, which is challenging as the appropriate rate depends on the complexity of the action. This can result in losing crucial details, making video a more reliable medium for understanding interactions.

b. Grounding propagation through semantic tracking enhances the G2TR's ability to maintain consistent object identification across frames, even after initial detection, ensuring more robust performance in dynamic environments.
The relative position of the target object might change after the language-referenced interaction took place. It is essential that its position be determined in the present world scene of the robot.

c. Providing visual options for the image-understanding vision-language model enhances its adaptability by allowing it to better handle ambiguity and improve decision-making in complex visual scenarios.
Converting the spatial description of target object in linguistic description might be difficult in visually complex scenarios and confuse the grounding model. It is therefore relatively easier and more effective to ground the object class first and then pick the target object from these grounded visual options.

Conclusion

We present G2TR, a novel approach to grounded temporal reasoning. We factorize the problem into three key components: (i) candidate interval localization in a video based on required interactions, (ii) fine-grained spatial reasoning within the localized interval to ground the target object, and (iii) tracking the object post-interaction. By leveraging pre-trained visual language models and large language models, G2TR achieves zero-shot generalization for both object set and interactions. We also propose a dataset of 155 video-instruction pairs covering spatially complex, multi-hop, partially observable, and multi-interaction tempo- ral reasoning tasks. Evaluation on the dataset shows signif- icant improvement over alternative approaches, highlighting G2TR’s potential in robot instruction following. Finally, it is important to note that G2TR currently has two limits: (i) it can only process videos up to one minute, as constrained by the video-reasoning model, and (ii) it can ground only a single object at a time. We aim to overcome both these limitations in future.



References

1. [CogVLM2-video] Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, 
        Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others
        CogVLM2: Visual Language Models for Image and Video Understanding
        arXiv preprint arXiv:2408.16500, 2024.
2. [GPT-4] Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, 
        Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others.
        Gpt-4 technical report
        arXiv preprint arXiv:2303.08774, 2023.
3. [Grounding DINO 1.5 Pro] Tianhe Ren and Qing Jiang and Shilong Liu and Zhaoyang Zeng and Wenlong Liu 
        and Han Gao and Hongjie Huang and Zhengyu Ma and Xiaoke Jiang and Yihao Chen and Yuda Xiong and Hao Zhang  
        and Feng Li and Peijun Tang and Kent Yu and Lei Zhang.
        Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
        arXiv preprint arXiv:2405.10300v2, 2024.
4. [SAM 2] Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and 
        Ma, Tengyu and Khedr, Haitham and R{\"a}dle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun, 
        Eric and Pan, Junting and Alwala, Kalyan Vasudev and Carion, Nicolas and Wu, Chao-Yuan and Girshick, 
        Ross and Doll{\'a}r, Piotr and Feichtenhofer, Christoph
        SAM 2: Segment Anything in Images and Videos
        arXiv preprint arXiv:2408.00714, 2024.