Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision

Abstract

Current large multimodal models (LMMs) face challenges in visual grounding, which requires the model to relate language components to visual entities. Contrary to common practice that fine-tunes LMMs with additional grounding supervision, we find that grounding ability can be implicitly learned by LMMs to some extent without explicit grounding supervision that sacrifices general conversation ability. To unlock this grounding ability, we first introduce a training-free strategy "Attend-and-Segment," which analyzes the attention within an off-the-shelf LMM to provide a point prompt to a segmentation model (e.g., SAM) and perform pixel-level segmentation. This strategy instantly enables visual grounding for existing LMMs while keeping their original conversation ability intact. Second, motivated by vision-language alignment and localized features embedded in diffusion models, we propose DiffLMM—a LLaVA-like LMM that utilizes a diffusion-based visual encoder instead of the standard CLIP visual encoder. This design enhances the implicit grounding ability without changing the training data. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach enables strong visual grounding while preserving general conversation capabilities. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 46.4 grounding mask recall on grounded conversation generation, outperforming the extensively supervised model GLaMM.

🚀 Attend-and-Segment: Unlocking Implicit Grounding

We introduce Attend-and-Segment, a training-free strategy to unlock the implicit grounding ability of LMMs. By analyzing the attention mechanism, we can determine "where the LMM is looking" and use this information to prompt a segmentation model like SAM to produce pixel-level grounding.

💎 DiffLMM: Diffusion-Based LMM for Enhanced Grounding

We propose DiffLMM, a novel LMM architecture that leverages a diffusion-based visual encoder. This design provides stronger localized features compared to standard CLIP encoders, enhancing the model's implicit grounding capabilities without requiring additional training data.

🏆 Competitive Performance Without Grounding Supervision

Model	Grounded Conversation Generation			General VQA
Model	Mask Recall	mIoU	METEOR	VQAv2	MMBench	MMStar
GLaMM	40.8	65.6	15.8	24.4	36.8	12.8
LLaVA-1.5 + a&s (Ours)	43.5	59.7	18.2	78.5	64.3	30.3
DiffLMM + a&s (Ours)	46.4	63.3	18.2	78.3	66.2	30.5

Our approach achieves state-of-the-art performance on grounded conversation generation while maintaining strong performance on general VQA benchmarks, all without any grounding supervision.

@inproceedings{cao2025emergent, title={Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision}, author={Shengcao Cao and Liang-Yan Gui and Yu-Xiong Wang}, booktitle={ICCV Findings}, year={2025} }

Emergent Visual Grounding in
Large Multimodal Models
Without Grounding Supervision

Our approach unlocks and enhances the grounding ability implicitly learned by LMMs without explicit grounding supervision, leading to visually grounded responses while preserving the general vision-language conversation ability.

Abstract

🚀 Attend-and-Segment: Unlocking Implicit Grounding

💎 DiffLMM: Diffusion-Based LMM for Enhanced Grounding

🏆 Competitive Performance Without Grounding Supervision

📚BibTeX

Emergent Visual Grounding inLarge Multimodal ModelsWithout Grounding Supervision

Our approach unlocks and enhances the grounding ability implicitly learned by LMMs without explicit grounding supervision, leading to visually grounded responses while preserving the general vision-language conversation ability.

Abstract

🚀 Attend-and-Segment: Unlocking Implicit Grounding

💎 DiffLMM: Diffusion-Based LMM for Enhanced Grounding

🏆 Competitive Performance Without Grounding Supervision

📚BibTeX

Emergent Visual Grounding in
Large Multimodal Models
Without Grounding Supervision