Generalized Grounded Temporal Reasoning

Consider the scenario where a human cleans a table and a robot observing the scene is instructed with the task “Robot, remove the cloth using which I wiped the table”. Instruction following with temporal reasoning requires the robot to identify the relevant past object interaction, ground the object of interest in the present scene, and execute the task according to the human’s instruction. Directly grounding utterances referencing past interactions to grounded objects is challenging due to the multi-hop nature of references to past interactions and large space of object groundings in a video stream observing the robot’s workspace. Our key insight is to factor the temporal reasoning task as (i) estimating the video interval associated with event reference, (ii) performing spatial reasoning over the interaction frames to infer the intended object (iii) semantically tracking the object’s location till the current scene to enable future robot interactions. Our approach leverages existing large pre-trained models (which possess inherent generalization capabilities) and combines them appropriately for temporal grounding tasks. Evaluation on a video-language corpus acquired with a robot manipulator displaying rich temporal interactions in spatially-complex scenes displays an average accuracy of 70.10% indicating the potential of G²TR in robot instruction following.

Generalized Grounded Temporal Reasoning (G²TR) Pipeline:
We address the temporal reasoning task by grounding the requisite interaction in the input-video for the candidate interval extraction, and then pinpointing the intended object through fine-grained spatial reasoning within this interval for grounding. Our framework comprises three components: (1) Temporal-Parser (TP) (2) Candidate interval estimation via Event-Localizer (EL), and (3) Grounding object of inter- est via Target-Detector (TD) followed by object re-acquisition via semantic tracking

(i) Temporal Parser: The parser formlulates the two key questions: a temporal question that determines "when" the described interaction occurs, and an object-identification question that specifiies "what" objects is involved in the interaction. Additionally, it also extracts the action that the robot needs to perform on the target object (as shown in figure above!). This is accomplished through in-context learning by providing a large language model (LLM) with input instruction.
(ii) Event Localization: This module performs temporal reasoning on accrued past observations to identify the likely interval of the required interaction.It takes in the entire video (with requisite object-interaction) and the temporal question from the previous module, and returns the specific time instant at which the specified interaction occured. This is implemented via a video-understanding vision language model.
(iii) Grounding Object of Interest: The purpose of this module is to perform fine-grained spatial reasoning and ground the target-object involved in the interaction. This module operates in three steps: (a) The input is sent to a Target Identifier (essentially an image-understanding vision language model) that extracts the class of the intended object that needs to be grounded. (b) A class- based detection returns the bounding box coordinates of all objects belonging to that class, thus providing visual options. (c) The Target Identifier (VLM) then identifies the target object by picking the intended target object from these visual options.
(iv) Grounding Propagation via Semantic Tracking: This additional module reasons over the grounded object state and location from the time of past interaction till the current world state for future robot manipulation. Again, a video-understanding vision language model, with tracking abilities is leveraged for this purpose

To assess the robot’s ability to perform grounded temporal reasoning, we form an evaluation corpus of 155 video-instruction pairs. The data set was collected in a table-top setting with a Franka Emika Robot Manipulator observing the scene via a eye-in-hand RGB-D camera. A total of 15 objects representating common household and healthcare items such as cups, bottles, medicines, fruits, notebooks, markers, handkerchiefs etc. were used. A total of 8 participants performed interactions such as pouring, placing, picking, stacking, replacing, wiping, dropping, repositioning, swapping etc. with 3 − 6 objects in each scene. The human participants also provided natural language instructions for the robot referencing to interactions and objects in the preceding interactions in the workspace. This resulted in a corpus of 155 video instruction pairs (each between 6 − 30 seconds). The instructions and interactions expressed natural diversity of spatial and temporal reasoning complexity. For detailed analysis, the evaluation corpus is bifurcated as: (i) single/multi hop temporal reasoning (56 : 99) (ii) simple/complex spatial grounding (98 : 57), (iii) single/multiple interactions (62 : 93) and (iv) partial or full observation of the referred object (36 : 119)

Reasoning Complexity	Interaction Complexity
Single Hop Instruction: Robot, pick the cloth that I just dropped!	Multi Hop Instruction: Robot, give me the object placed second!	Single Interaction Instruction: Robot, give the cloth that was just placed!	Multi Interaction Instruction: Robot, pick cup in which water was poured by orange bottle!
Spatial/Visual Complexity	Observability
Spatially Simple Instruction: Robot, remove the cloth used for wiping!	Spatially Complex Instruction: Robot, point to the cup which was just placed on the table	Completely Observable Instruction: Robot, pick the object that was just placed!	Partially Observable Instruction: Robot, where is the medicine?

To achieve generalized grounded temporal reasoning, G²TR utilizes the reasoning capabilities of several pre-trained large models. To ensure these models perform effectively and appropriately for the designated tasks, various prompts and prompting strategies were explored, as detailed below.

Temporal Parser: For this parser, an LLM [GPT-4] was employed. This LLM-based parser leverages in-context learning with a few-shot approach. The prompt provides a comprehensive background on the robotic setting, enabling it to generate a temporal question, object identification question, and robot actions—that serve as essential inputs for subsequent modules.


prompt_temporal_parser: |
"""
          
    example_parsing  = [
          {
              "instruction" : "Point to the bottle that I just placed.",
  
              "ground_truth" : {
                  "temporal_question": "When is placing by the person happening in the video? 
                  Give the exact timestamp.",
                  "action": "Point to",
                  "object_identification_question": "Identify the bottle that is being placed in these frames."
  
              },
          },
          {
              "instruction" : "Remove the object that was first eaten",
              "ground_truth" : {
                  "temporal_question": "When is eating of first object by the person happening in the video? 
                  Give the exact timestamp.",
                  "action": "Remove",    
                  "object_identification_question": "Identify the object that is being by the person in these frames."
              },
          },
          {
              "instruction" : "Remove the object that was placed last",
              "ground_truth" : {
                  "temporal_question": "When is placing of last object by the person happening in the video? 
                  Give the exact timestamp.",
                  "action": "Remove",    
                  "object_identification_question": "Identify the object that is being placed by the person 
                  in these frames."
              },
          },
              {
              "instruction" : "Where is the apple?",
              "ground_truth" : {
                  "temporal_question": "When was the apple last seen? Give the exact timestamp.",
                  "action": "",   
                  "object_identification_question": "Where is the apple?" 
              },
          },
          {
              "instruction" : "Identify the object that was used for cleaning second by the girl",
              "ground_truth" : {
                  "temporal_question": "When is the cleaning using a second object by the girl happening in the video? 
                  Give the exact timestamp.",
                  "action": "Identify",    
                  "object_identification_question": "Identify the object used for cleaning by the girl in these frames."
              },
          },
  
          {
              "instruction" : "Remove the cloth that was used by the boy",
              "ground_truth" : {
                  "temporal_question": "When is the using of cloth by the boy happening in the video? 
                  Give the exact timestamp.",
                  "action": "Remove",    
                  "object_identification_question": "Identify the cloth used by the boy in these frames."
                  "object" : "cloth used by boy",
                  "interaction": human-interaction with cloth by the boy
              },
          },
          {
              "instruction" : "Retrieve the bottle that was filled first by the girl",
              "ground_truth" : {
                  "temporal_question": "When is filling of the first bottle happening in the video? 
                  Give the exact timestamp.",
                  "action": "Retrieve",    
                  "object_identification_question": "Identify the object that is being by the person in these frames."
                  "object" : "bottle that was filled by the girl",
                  "interaction" : "bottle filling by the girl"
              },
          },
        
    ]
  
There is a robot that needs to take a human instruction and formulate a temporal question as per instruction, what action 
to take and the object of interest. Given the human instruction, return a dictionary with 'temporal_question', 'action', 
'object_identification_question', 'object', and 'interaction' as keys. For the 'object' and 'interaction' keys, remove 
the temporal aspect and clues such as 'at 2nd second' or 'last'. Return answer in JSON format always. 
"""

Event Localizer (Candidate Interval Extraction): This module leverages the temporal reasoning capabilities of a video-understanding VLM [CogVLM2-video], using the temporal question generated by the aforementioned temporal parser as its input.


event_localizer_prompt = json ( temporal_parser_output ) [" temporal_question "]

Grounding object of interest: The workflow of this module consists of three steps: the first and third steps utilize spatial reasoning and image understanding capabilities of a Vision-Language Model, [GPT-4] while the second step employs grounding capabilities of a Phrase Grounding Model. [Grounding DINO 1.5 Pro] .The prompt for this module, again derived from the temporal parser output, utilizes the object identification question. In the first step, the module retrieves the class of the object involved in the language-referenced human interaction. In the third step, using the same prompt with an added emphasis, it is used to retrieve the object label from given visual options.


target_detector_step1_prompt = json(temporal_parser_output)["object_identification_question"] + 
"Return just the object class."


target_detector_step3_prompt = json(temporal_parser_output)["object_identification_question"] + 
"Return just the object label."

Partial Observability Cases: In these cases, as the target object becomes partially observable, the prompts are designed to retrieve the object that is closely associated with or occluding the target-object.


object_class = target_ detector_step1_output

"Where did " + object_class + " go? Give the object that partially or completely hid the " + object_class.

Example if the target object is a marker, which was kept inside a book, and the book was put in a bag, the prompts would be:


"Where did the marker go? Give the object that partially or completely hid the marker." 

## And once the marker goes out of view and the occluding object is determined to be a book:

"Where did the book go? Give the object that partially or completely hid the book."

Thus, the prompts are iteratively refined and re-issued until the final occluding object is identified.

The proposed model, G²TR, was compared against two baseline approaches, each reflecting alternative methods for addressing the temporal reasoning problem, resulting in different architectures for combining pre-trained models. Only open-set models were included in the evaluation due to their generality.

G²TR: This model leverages a video-understanding VLM [CogVLM2-video] for candidate interval identification, followed by target detection and grounding using a Vision Language Model [GPT-4] and a Phrase Grounder [Grounding DINO 1.5 Pro] . It also integrates a tracker module [SAM 2] to handle environmental changes.

Direct Temporal Visual Grounding (DTVG): This approach uses a video-understanding VLM for both temporal and spatial reasoning, followed by a Phrase Grounder to identify the intended object, e.g., “green cloth on the left.”

We have compared the performance of the proposed model G²TR with alternative approaches -Direct Temporal Visual Grounding (DTVG) and Refined Temporal Visual Grounding (RTVG). The proposed approach has better performance than suggested alternate approaches by 26%, in terms of overall average grounding accuracy. It also performs better than the alternate approaches in each of the different scenarios that are considered in our dataset bifurcation.

Analysis of our pipeline,G²TR , shows that the main bottleneck is the Event Localizer’s difficulty in precisely identifying the time instant of the language-referenced interaction. The next bottleneck occurs in the Target Detector, which struggles with selection of incorrect object class, insufficient visual prompts, or incorrect selection of object label due to ambiguity in visual options. The Tracker occasionally begins tracking the wrong object, when it appears identical to the correct one. While our approach performs well in complex scenarios, it struggles with rapid human-object interactions and single actions involving multiple objects, such as replacing or stacking. Another limitation of our pipeline is its tendency to perform reasoning based on linguistic cues rather than visual evidence. For instance, when tasked with identifying the object into which water was poured, it may incorrectly output a cup, even if the water was poured into a different container.

Component-Wise Analysis of Failure Cases (45/155)

Component-Wise Analysis frequency of failure cases generated by each module

Failure Case Examples

Image 1 Description — **Instruction:** Robot, pick bottle placed just before orange bottle

Image 2 Description — **Instruction:** Pick the container in which water was poured

Image 3 Description — **Instruction:** Pick the container by which water was poured

Image 4 Description — **Instruction:** Robot, pick the cup under
which strawberry is hidden!

Image 6 Description — **Instruction:** Robot, point to the cup which
was just used by the person

Examples of Failure Cases displays instances of failure within G²TR framework.

a. Leveraging a combination of specialized temporal and spatial video and image-understanding models can provide deeper contextual understanding compared to only a general-purpose image-understanding vision-language model.
Interactions are inherently temporal, and videos capture the full sequence, preserving context and flow. Converting this to the image domain requires sub-sampling, which is challenging as the appropriate rate depends on the complexity of the action. This can result in losing crucial details, making video a more reliable medium for understanding interactions.

b. Grounding propagation through semantic tracking enhances the G²TR's ability to maintain consistent object identification across frames, even after initial detection, ensuring more robust performance in dynamic environments.
The relative position of the target object might change after the language-referenced interaction took place. It is essential that its position be determined in the present world scene of the robot.

c. Providing visual options for the image-understanding vision-language model enhances its adaptability by allowing it to better handle ambiguity and improve decision-making in complex visual scenarios.
Converting the spatial description of target object in linguistic description might be difficult in visually complex scenarios and confuse the grounding model. It is therefore relatively easier and more effective to ground the object class first and then pick the target object from these grounded visual options.

d. Challenges in addressing composite interactions involving multiple sub-actions and temporal ambiguities
The dataset includes scenarios where a single human interaction involves multiple sub-interactions, such as "stacking" multiple objects or "replacing" one object with another. These composite actions, described in instructions like "Rebuild that stack that I just built" or "Remove the object that replaced the green bottle," introduce challenges in determining the precise timestamps of individual actions. This temporal ambiguity occasionally leads to inconsistent performance, highlighting the need for future work to improve handling of such complex scenarios.

e. Timing efficiency analysis of G²TR, highlighting bottlenecks in target detection and the impact of partial observability
The timing analysis of G²TR reveals that the target detection module is the most time-consuming, averaging 32.25 ± 1.11 seconds under full observability, due to its iterative grounding mechanism. Event localization and object tracking take 5.65 ± 0.10 seconds and 6.29 ± 0.35 seconds, respectively, resulting in an average runtime of 46.98 ± 1.48 seconds. Under partial observability, the runtime nearly doubles to 88.57 ± 7.26 seconds because of iterative detection and tracking, highlighting a key bottleneck that limits zero-shot generalization. All experiments were conducted using GPT-4o [GPT-4] and Grounding-DINO 1.5 Pro [Grounding DINO 1.5 Pro] on a server with 48GB GPU memory, accessed via cloud.

We present G²TR, a novel approach to grounded temporal reasoning. We factorize the problem into three key components: (i) candidate interval localization in a video based on required interactions, (ii) fine-grained spatial reasoning within the localized interval to ground the target object, and (iii) tracking the object post-interaction. By leveraging pre-trained visual language models and large language models, G²TR achieves zero-shot generalization for both object set and interactions. We also propose a dataset of 155 video-instruction pairs covering spatially complex, multi-hop, partially observable, and multi-interaction tempo- ral reasoning tasks. Evaluation on the dataset shows signif- icant improvement over alternative approaches, highlighting G²TR’s potential in robot instruction following. Finally, it is important to note that G2TR currently has two limits: (i) it can only process videos up to one minute, as constrained by the video-reasoning model, and (ii) it can ground only a single object at a time. We aim to overcome both these limitations in future.

References

1. [CogVLM2-video] Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, 
        Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others
        CogVLM2: Visual Language Models for Image and Video Understanding
        arXiv preprint arXiv:2408.16500, 2024.

2. [GPT-4] Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, 
        Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others.
        Gpt-4 technical report
        arXiv preprint arXiv:2303.08774, 2023.

3. [Grounding DINO 1.5 Pro] Tianhe Ren and Qing Jiang and Shilong Liu and Zhaoyang Zeng and Wenlong Liu 
        and Han Gao and Hongjie Huang and Zhengyu Ma and Xiaoke Jiang and Yihao Chen and Yuda Xiong and Hao Zhang  
        and Feng Li and Peijun Tang and Kent Yu and Lei Zhang.
        Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
        arXiv preprint arXiv:2405.10300v2, 2024.

4. [SAM 2] Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and 
        Ma, Tengyu and Khedr, Haitham and R{\"a}dle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun, 
        Eric and Pan, Junting and Alwala, Kalyan Vasudev and Carion, Nicolas and Wu, Chao-Yuan and Girshick, 
        Ross and Doll{\'a}r, Piotr and Feichtenhofer, Christoph
        SAM 2: Segment Anything in Images and Videos
        arXiv preprint arXiv:2408.00714, 2024.

G²TR - Generalized Grounded Temporal Reasoning for Robot Instruction Following via Coupled Pre-trained Models

Introduction

Technical Approach Overview

Evaluation Corpus

Video Demonstrations

Prompting Details

Alternate Approaches (Baselines)

Experimental Insights

Component-Wise Analysis of Failure Cases (45/155)

Failure Case Examples

Additional Insights

Conclusion

References