Incorporating Foundation Model Priors in Modeling Novel Objects for Robot Instruction Following in Unstructured Environments

Moksh Malhotra1, Aman Tambi1*, Sandeep S. Zachariah1*, P. V. M. Rao1, Rohan Paul1
1Affiliated with Indian Institute of Technology Delhi, {*}Indicate equal contributions,

Abstract

This paper addresses the challenge of acquiring object models to facilitate task execution in unstructured, unknown environments. Illustrated through the scenario of commanding a robot to explore unfamiliar terrain and interact with objects, the necessity of a metric-semantic representation of the environment becomes apparent. Such representation not only identifies present objects but also maintains their geometric attributes for future interactions, such as manipulation or relocation. Handling large, unfamiliar objects, like trusses or tree branches, necessitates intricate reasoning in grasping, transport, and placement phases, posing a significant challenge for a versatile manipulation agent. The paper proposes an approach that integrates prior knowledge from pre-trained models with real-time data to generate detailed object models, essential for sequential manipulation tasks. This method involves utilizing pre-trained Vision-and-Language Models (VLMs) to extract object masks from raw point clouds and integrating depth priors from foundation models for improved geometric accuracy. Furthermore, the approach includes mechanisms for building local maps and local repairs over sequential action execution. Experimental results demonstrate the effectiveness of the proposed approach in acquiring high-quality 3D object models compared to alternative methods for unstructured scenarios.

Approach Overview

The proposed method sequentially builds the global point cloud using a sequence of posed RGBD images. A guided filter employing depth priors from foundation models is used to refine the noisy depth images. The semantic extractor detects all objects and generates the segmentation mask for each object, which is fused with the global point cloud to extract the 3D model of the objects.

Results

3D Object Models

Qualitative Comparision

Image 1

he proposed method, which utilizes depth priors, produces more accurate and complete 3D object models in relation to the unfiltered approach. Specifically, the resulting models show lower noise reduction, smooth surfaces, and capture structural informa- tion, which is lacking in the unfiltered method (directly masking raw point cloud). (Ignore the colors)

Quantitative Analysis

Image 2

We quantitatively evaluate our method using two metrics: (i) Root Mean Squared Error (RMSE): RMSE quantifies the root mean squared distance of each point in the reconstructed model to the nearest point in the ground truth. Results indi- cate that the proposed approach has a lower error compared to the unfiltered approach with respect to the ground truth. (ii) Model Reconstruction Accuracy (MRA): MRA quantifies the percentage of points in the reconstructed model that are within a distance d to the nearest point in the ground truth. Results indicate that the proposed method has a higher Model Reconstruction Accuracy compared to the unfiltered approach, meaning that a higher percentage of points of the reconstructed model are within tolerance d from the ground truth.



Local Scene Update Post Action Execution


Visualization of the plan rollout and scene reconstruction for a scenario involving occlusion. Red arrows indicate the pose update of the object, while blue arrows represent local scene reconstruction. Initially, only the briefcase was visible. Upon removing the briefcase and subsequent local rebuilding, the hose was detected. The figure also demonstrates how the scene is updated when the object is manipulated without requiring global scene reconstruction.