Consider the scenario where a human cleans a table and a robot observing the scene is instructed with the task “Robot, remove the cloth using which I wiped the table”. Instruction following with temporal reasoning requires the robot to identify the relevant past object interaction, ground the object of interest in the present scene, and execute the task according to the human’s instruction. Directly grounding utterances referencing past interactions to grounded objects is challenging due to the multi-hop nature of references to past interactions and large space of object groundings in a video stream observing the robot’s workspace. Our key insight is to factor the temporal reasoning task as (i) estimating the video interval associated with event reference, (ii) performing spatial reasoning over the interaction frames to infer the intended object (iii) semantically tracking the object’s location till the current scene to enable future robot interactions. Our approach leverages existing large pre-trained models (which possess inherent generalization capabilities) and combines them appropriately for temporal grounding tasks. Evaluation on a video-language corpus acquired with a robot manipulator displaying rich temporal interactions in spatially-complex scenes displays an average accuracy of 70.10% indicating the potential of G2TR in robot instruction following.
To assess the robot’s ability to perform grounded temporal reasoning, we form an evaluation corpus of 155 video-instruction pairs. The data set was collected in a table-top setting with a Franka Emika Robot Manipulator observing the scene via a eye-in-hand RGB-D camera. A total of 15 objects representating common household and healthcare items such as cups, bottles, medicines, fruits, notebooks, markers, handkerchiefs etc. were used. A total of 8 participants performed interactions such as pouring, placing, picking, stacking, replacing, wiping, dropping, repositioning, swapping etc. with 3 − 6 objects in each scene. The human participants also provided natural language instructions for the robot referencing to interactions and objects in the preceding interactions in the workspace. This resulted in a corpus of 155 video instruction pairs (each between 6 − 30 seconds). The instructions and interactions expressed natural diversity of spatial and temporal reasoning complexity. For detailed analysis, the evaluation corpus is bifurcated as: (i) single/multi hop temporal reasoning (56 : 99) (ii) simple/complex spatial grounding (98 : 57), (iii) single/multiple interactions (62 : 93) and (iv) partial or full observation of the referred object (36 : 119)
Reasoning Complexity | Interaction Complexity | ||
---|---|---|---|
Single Hop Instruction: Robot, pick the cloth that I just dropped! |
Multi Hop Instruction: Robot, give me the object placed second! |
Single Interaction Instruction: Robot, give the cloth that was just placed! |
Multi Interaction Instruction: Robot, pick cup in which water was poured by orange bottle! |
Spatial/Visual Complexity | Observability | ||
Spatially Simple Instruction: Robot, remove the cloth used for wiping! |
Spatially Complex Instruction: Robot, point to the cup which was just placed on the table |
Completely Observable Instruction: Robot, pick the object that was just placed! |
Partially Observable Instruction: Robot, where is the medicine? |
Demonstration 1 - A Simple Scenario: Video showing the robot executing the instruction "Pick the object that I just placed."
The robot in scene is a Franka Emika Arm Manipulator. There is a table in front of the robot, on which various objects (glasses, a cup,
a small bottle) are already kept and the robot is observing this scene with its camera. Then a person places a banana on the table and an instruction to pick the object that was just placed is given, which the robot executes.
Demonstration 2- Visually complex scenario with multiple interactions: Video showing the robot carrying out instruction "Robot, remove the cloth that was used to wipe the table"
The robot observes a table with four cloths (two green, one blue, one pink) and a cleaning liquid.
After a human rearranges the items and cleans with one cloth, the robot is instructed to remove the used cloth.
It then uses the G2TR pipeline to identify and dispose of the correct cloth in the bin.
Demonstration 3 - Scenario involving partial observability : Video showing robot responding to the question "Robot, where is the marker that I just used?"
The robot, with its camera, observes a table with two books, two markers, and a bag.
A human uses a marker, places it in a book, and then puts the book in the bag.
The robot is instructed to identify and point to the partially occluded marker.
Using the G2TR pipeline, it successfully identifies and points to the bag containing the marker.
Demonstration 4: Multi-hop reasoning scenario: Video showing the robot executing the instruction "Give me the bottle placed second by the human"
The robot, equipped with a camera, observes a shelf as a human sequentially places three bottles on it.
Later, another individual instructs the robot to retrieve the second bottle placed. To execute this task,
the robot leverages the G2TR pipeline,
which performs multi-hop reasoning to identify the correct bottle and the robot successfully fulfills the command.
To achieve generalized grounded temporal reasoning,
G2TR utilizes the reasoning capabilities of several
pre-trained large models. To ensure these models perform
effectively and appropriately for the designated tasks,
various prompts and prompting strategies were explored, as detailed below.
Temporal Parser: For this parser, an LLM
[GPT-4]
was employed.
The prompt provided a comprehensive background on the robotic setting,
along with a few-shot learning approach using in-context examples,
enabling it to generate relevant questions for the subsequent modules.
example_parsing = ["instruction" : "Pick the bottle that I just placed.",
"response" : {
"temporal_question": "When is placing by the person happening in the video?
Give the exact timestamp.",
"object-interaction" : "the bottle that was placed by the person"
" action": "Pick"},
"instruction" : "Remove the object that was first eaten",
"response" : {
"temporal_question": "When is eating of first object by the person happening
in the video? Give the exact timestamp. ",
"object-interaction" : "object eaten by person"
"action": “Point"},
"instruction" : "Where is the apple?",
"response" : {
"temporal_question": "When was the apple last seen? Give the exact timestamp.",
“object-interaction" : "human-interaction happening with apple"
"action": "Point to"},
"instruction" : "Remove the cloth that was used by the boy",
"response" : {
"temporal_question": "When is the using of cloth by the boy happening in the
video? Give the exact timestamp.",
"object-interaction" : "cloth used by boy"
"action": "Remove",}]
There is a robot that needs to take a human instruction and formulate a temporal question as per instruction,
identify the object-interaction of interest and what action to take . Given the human instruction, return a
dictionary with 'temporal_question', 'object-interaction’, and 'action' as keys. For the 'object-interaction'
key, remove the temporal aspect and clues such as 'at 2nd second' or 'last'. Return answer in JSON format always
"From the given images, identify with which OBJECT the INTERACTION happened. Return that object only. The object
interaction is : " + object-interaction (from temporal parser)
While asking the same image-understanding model to pick the target object from visual options,
the prompt is slightly varied:
"From the given images, identify with which single OBJECT the INTERACTION happened. The last image given to you has
visual options. Pick the right object number with which the INTERACTION is happening.Return object label."
In partial observability cases, the target object goes out of view.
In order to track the occluding object, the module is re-prompted as:
"Where did " + object_class + " go? Give the object that partially or completely hid the " + object_class.
"Where did the marker go? Give the object that partially or completely hid the marker."
## And once the marker goes out of view and the occluding object is determined to be a book:
"Where did the book go? Give the object that partially or completely hid the book."
The proposed model, G2TR, was compared against two baseline approaches,
each reflecting alternative methods for addressing the temporal reasoning
problem, resulting in different architectures for combining pre-trained models.
Only open-set models were included in the evaluation due to their generality.
G2TR: This model leverages a video-understanding VLM
[CogVLM2-video]
for candidate
interval identification, followed by target detection and grounding using a Vision
Language Model [GPT-4]
and a Phrase Grounder
[Grounding DINO 1.5 Pro]
. It also integrates
a tracker module [SAM 2]
to handle environmental changes.
We have compared the performance of the proposed
model G2TR with alternative approaches
-Direct Temporal Visual Grounding (DTVG) and Refined Temporal Visual Grounding (RTVG).
The proposed approach has better performance
than suggested alternate approaches by 26%, in terms
of overall average grounding accuracy. It also performs better
than the alternate approaches in each of the different scenarios
that are considered in our dataset bifurcation.
Analysis of our pipeline,G2TR , shows that the main bottleneck
is the Event Localizer’s difficulty in precisely identifying the time instant
of the language-referenced interaction. The next bottleneck
occurs in the Target Detector, which struggles with selection
of incorrect object class, insufficient visual prompts, or
incorrect selection of object label due to ambiguity in visual
options. The Tracker occasionally begins tracking the wrong
object, when it appears identical to the correct one.
While our approach performs well in complex scenarios,
it struggles with rapid human-object interactions and single
actions involving multiple objects, such as replacing or
stacking. Another limitation of our pipeline is its tendency
to perform reasoning based on linguistic cues rather than
visual evidence. For instance, when tasked with identifying
the object into which water was poured, it may incorrectly
output a cup, even if the water was poured into a different
container.
a. Leveraging a combination of specialized temporal and spatial video and image-understanding models can provide
deeper contextual understanding compared to only a general-purpose image-understanding vision-language model.
Interactions are inherently temporal, and videos capture the
full sequence, preserving context and flow. Converting this to the
image domain requires sub-sampling, which is challenging as the appropriate
rate depends on the complexity of the action. This can result in losing
crucial details, making video a more reliable medium for understanding interactions.
b. Grounding propagation through semantic tracking enhances the G2TR's ability to maintain consistent
object identification across frames, even after initial detection, ensuring more robust performance in
dynamic environments.
The relative position of the target object might change after the language-referenced interaction
took place. It is essential that its position be determined in the present world scene of the robot.
c. Providing visual options for the image-understanding vision-language model enhances its adaptability by
allowing it to better handle ambiguity and improve decision-making in complex visual scenarios.
Converting the spatial description of target object in linguistic description
might be difficult in visually complex scenarios and confuse the grounding model.
It is therefore relatively easier and more effective to ground the object class first
and then pick the target object from these grounded visual options.
We present G2TR, a novel approach to grounded temporal reasoning. We factorize the problem into three key components: (i) candidate interval localization in a video based on required interactions, (ii) fine-grained spatial reasoning within the localized interval to ground the target object, and (iii) tracking the object post-interaction. By leveraging pre-trained visual language models and large language models, G2TR achieves zero-shot generalization for both object set and interactions. We also propose a dataset of 155 video-instruction pairs covering spatially complex, multi-hop, partially observable, and multi-interaction tempo- ral reasoning tasks. Evaluation on the dataset shows signif- icant improvement over alternative approaches, highlighting G2TR’s potential in robot instruction following. Finally, it is important to note that G2TR currently has two limits: (i) it can only process videos up to one minute, as constrained by the video-reasoning model, and (ii) it can ground only a single object at a time. We aim to overcome both these limitations in future.
1. [CogVLM2-video] Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang,
Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others
CogVLM2: Visual Language Models for Image and Video Understanding
arXiv preprint arXiv:2408.16500, 2024.
2. [GPT-4] Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman,
Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others.
Gpt-4 technical report
arXiv preprint arXiv:2303.08774, 2023.
3. [Grounding DINO 1.5 Pro] Tianhe Ren and Qing Jiang and Shilong Liu and Zhaoyang Zeng and Wenlong Liu
and Han Gao and Hongjie Huang and Zhengyu Ma and Xiaoke Jiang and Yihao Chen and Yuda Xiong and Hao Zhang
and Feng Li and Peijun Tang and Kent Yu and Lei Zhang.
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
arXiv preprint arXiv:2405.10300v2, 2024.
4. [SAM 2] Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and
Ma, Tengyu and Khedr, Haitham and R{\"a}dle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun,
Eric and Pan, Junting and Alwala, Kalyan Vasudev and Carion, Nicolas and Wu, Chao-Yuan and Girshick,
Ross and Doll{\'a}r, Piotr and Feichtenhofer, Christoph
SAM 2: Segment Anything in Images and Videos
arXiv preprint arXiv:2408.00714, 2024.