ICML Poster Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Spotlight Poster

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Aravind Rajeswaran · Arjun Majumdar · Mikael Henaff · Oleksandr Maksymets · Krishna Murthy Jatavallabhula · Mrinal Kalakrishnan · Sergio Arnaud · Paul McVay · Alexander Sax · Ada Martin · Ruslan Partsey · Franziska Meier · Nicolas Ballas · Michael Rabbat · Mahmoud Assran · Ayush Jain · Vincent-Pierre Berges · Ang Cao · Abha Gejji · Daniel Dugas · Phillip Thomas · Ishita Prasad

[ Abstract ]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

We present Locate 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp". Locate 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, Locate 3D operates directly on sensor observation streams (posed RGB-D images), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. The input to 3D-JEPA is a 3D pointcloud, featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LX3D, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.

Live content is unavailable. Log in and register to view live content