This paper addresses the challenge of acquiring object models to facilitate task execution in unstructured, unknown environments. Illustrated through the scenario of commanding a robot to explore unfamiliar terrain and interact with objects, the necessity of a metric-semantic representation of the environment becomes apparent. Such representation not only identifies present objects but also maintains their geometric attributes for future interactions, such as manipulation or relocation. Handling large, unfamiliar objects, like trusses or tree branches, necessitates intricate reasoning in grasping, transport, and placement phases, posing a significant challenge for a versatile manipulation agent. The paper proposes an approach that integrates prior knowledge from pre-trained models with real-time data to generate detailed object models, essential for sequential manipulation tasks. This method involves utilizing pre-trained Vision-and-Language Models (VLMs) to extract object masks from raw point clouds and integrating depth priors from foundation models for improved geometric accuracy. Furthermore, the approach includes mechanisms for building local maps and local repairs over sequential action execution. Experimental results demonstrate the effectiveness of the proposed approach in acquiring high-quality 3D object models compared to alternative methods for unstructured scenarios.
he proposed method, which utilizes depth priors, produces more accurate and complete 3D object models in relation to the unfiltered approach. Specifically, the resulting models show lower noise reduction, smooth surfaces, and capture structural informa- tion, which is lacking in the unfiltered method (directly masking raw point cloud). (Ignore the colors)
We quantitatively evaluate our method using two metrics: (i) Root Mean Squared Error (RMSE): RMSE quantifies the root mean squared distance of each point in the reconstructed model to the nearest point in the ground truth. Results indi- cate that the proposed approach has a lower error compared to the unfiltered approach with respect to the ground truth. (ii) Model Reconstruction Accuracy (MRA): MRA quantifies the percentage of points in the reconstructed model that are within a distance d to the nearest point in the ground truth. Results indicate that the proposed method has a higher Model Reconstruction Accuracy compared to the unfiltered approach, meaning that a higher percentage of points of the reconstructed model are within tolerance d from the ground truth.
Visualization of the plan rollout and scene reconstruction for a scenario involving occlusion. Red arrows indicate the pose update of the object, while blue arrows represent local scene reconstruction. Initially, only the briefcase was visible. Upon removing the briefcase and subsequent local rebuilding, the hose was detected. The figure also demonstrates how the scene is updated when the object is manipulated without requiring global scene reconstruction.