Real-Time 3D Vision-Language Embedding Mapping

Anonymous

This work introduces a method for real-time 3D mapping of image embeddings using only RGB-D streams, without requiring ground-truth camera poses or environment-specific pretraining. Our approach is environment- and task-agnostic, providing a foundation for language-conditioned robotic tasks.

Query: first aid kit

Query: floor

Abstract

A metric-accurate semantic 3D representation is essential for many robotic tasks. This work proposes a simple, yet powerful, way to integrate the 2D embeddings of a Vision-Language Model in a metric-accurate 3D representation at real-time. We combine a local embedding masking strategy, for a more distinct embedding distribution, with a confidence-weighted 3D integration for more reliable 3D embeddings. The resulting metric-accurate embedding representation is task-agnostic and can represent semantic concepts on a global multi-room- as well as on a local object-level. This enables a variety of interactive robotic applications that require the localisation of objects-of-interest via natural language. We evaluate our approach on a variety of real-world sequences and demonstrate that these strategies achieve a more accurate object-of-interest localisation while improving the runtime performance in order to meet our real-time constraints. We further demonstrate the versatility of our approach in a variety of interactive handheld, mobile robotics and manipulation tasks, requiring only raw image data.

Evaluation

We collect RGB-D sequences using an Orbbec Femto Bolt Time-of-Flight camera in three setups: handheld, wrist-mounted on a Franka Emika arm, and mounted on a Segway RMP Lite 220 mobile base. Color and depth images are captured at 1280x720 resolution and 30 Hz.

Query on 2D Image

We compare our method against ConceptFusion and LangSplat using two OpenCLIP variants. For evaluation, we manually annotated frames from the kitchen sequence and computed the mIoU PR AUC across multiple text queries. While LangSplat achieves higher accuracy, it fails to operate in real time due to its offline embedding pipeline. Our method offers a strong trade-off: it improves text-to-embedding alignment over ConceptFusion and achieves real-time inference at 1.53 Hz. The table below reports mIoU PR AUC scores for different text queries on the kitchen sequence. Our method with ViT-B/16 delivers the best real-time segmentation accuracy (bold). LangSplat, although included as an upper-bound reference, is not real-time capable^*.

Object	CF ViT‑H/14	LS* ViT‑B/16	Ours
Object			ViT‑H/14	ViT‑B/16
sugar	0.170	0.070	0.043	0.075
milk	0.702	0.954	0.325	0.535
faucet	0.030	0.704	0.138	0.119
towel	0.613	0.875	0.726	0.709
detergent	0.159	0.953	0.291	0.288
chair	0.895	0.570	0.375	0.346
coffee machine	0.403	0.972	0.685	0.779
sink	0.495	0.699	0.529	0.569
socket	0.354	0.763	0.396	0.362
sponge	0.404	0.677	0.235	0.281
paper towel	0.365	0.516	0.459	0.559
table top	0.261	0.537	0.399	0.494
average	0.404	0.691*	0.383	0.426

Text-Based 3D Querying

By combining camera pose estimation, masked local embedding extraction, and their integration into a 3D map, our method produces a unified color and embedding representation in real time. We qualitatively compare query responses on the kitchen sequence against ConceptFusion and LangSplat. Unlike these baselines, which operate offline, our system processes streaming data and selectively skips frames to maintain real-time performance. Our multivalued hash map yields fewer visual artifacts than ConceptFusion's point cloud and outperforms LangSplat in unobserved regions, while also providing metric-accurate geometry essential for robotic applications.

3D environment reconstruction of the kitchen sequence with similarity heatmaps for text queries (blue: low, red: high). The queries coffee machine, milk, sink, and socket are marked using red and black bounding boxes, respectively. Thanks to the use of masked local embeddings, our method produces sharper query responses compared to ConceptFusion and exhibits fewer outliers than LangSplat. Additionally, our dense color model (first row) contains significantly fewer visual artifacts.

ConceptFusion

LangSplat

Ours (ViT-B/16)

coffee machine

milk

sink

socket

sponge

table top

towel

paper towel

chair

detergent

faucet

sugar

We showcase qualitative reconstructions and similarity heatmaps from the kitchen, workshop, and table sequences, demonstrating how query specificity affects localization. In the kitchen, a general query like coffee highlights both the coffee machine and the milk carton, capturing broader contextual relevance. In contrast, querying for milk results in a focused response limited to the milk carton. Similarly, in the workshop, a precise query for the Bosch drill produces a sharper activation than the more generic drill query. These examples highlight the open-set nature of our method, enabling flexible interaction across diverse environments and semantic levels.

RGB Image

Query: coffee

Query: milk

RGB Image

Query: toolbox

Query: lathe

RGB Image

Query: drill

Query: Bosch drill

RGB Image

Query: power supply

Query: drawer

RGB Image

Query: plant

Query: table

Interactive Tasks

To move beyond passive perception, robots must engage with their environment through goal-directed actions informed by high-level semantic understanding. Our approach enables this by grounding language queries in a unified 3D representation that is built online from RGB-D input. We explore how such a representation facilitates interactive tasks in both local and extended workspaces, ranging from precision-driven pick-and-place operations to semantically guided navigation across multi-room environments. These demonstrations highlight the system's ability to generalize across scales and platforms while maintaining real-time responsiveness and open-vocabulary flexibility.

Manipulation

In our interactive pick-and-place system, users can specify both the object to grasp and the desired placement location through natural language queries. Our method continuously estimates the camera pose, builds a 3D color map, and fuses local language-conditioned embeddings into the scene in real time. When a query is issued, the system computes similarity between the 3D embedding map and the text, segmenting the scene into source and target point clouds. From these, we extract a grasp pose by projecting points onto a plane aligned with the robot's approach direction and selecting the most compact region suitable for a parallel gripper.

◀

▶

Room- and Object-Level Localisation

Our method enables real-time, memory-efficient reconstruction across large-scale environments while maintaining both spatial coverage and semantic precision. In the office-floor scenario, we reconstruct a unified 3D map that supports multiple spatial scales, both multi-room and object-level queries. The system is capable of identifying broad semantic regions such as floors that extend across several rooms, as well as individual objects including a first aid kit and a bookshelf. The system remains interactive throughout the reconstruction, updating the map live as new data streams in.

RGB Room Map

Query: floor

RGB Room Map

Query: ceiling

RGB Room Map

Query: book shelf

RGB Room Map

Query: first aid kit

BibTeX

BibTex Code Here