Emergent Visual Grounding in
Large Multimodal Models
Without Grounding Supervision

University of Illinois Urbana-Champaign
Teaser Image

Our approach unlocks and enhances the grounding ability implicitly learned by LMMs without explicit grounding supervision, leading to visually grounded responses while preserving the general vision-language conversation ability.

Abstract

Current large multimodal models (LMMs) face challenges in visual grounding, which requires the model to relate language components to visual entities. Contrary to common practice that fine-tunes LMMs with additional grounding supervision, we find that grounding ability can be implicitly learned by LMMs to some extent without explicit grounding supervision that sacrifices general conversation ability. To unlock this grounding ability, we first introduce a training-free strategy "Attend-and-Segment," which analyzes the attention within an off-the-shelf LMM to provide a point prompt to a segmentation model (e.g., SAM) and perform pixel-level segmentation. This strategy instantly enables visual grounding for existing LMMs while keeping their original conversation ability intact. Second, motivated by vision-language alignment and localized features embedded in diffusion models, we propose DiffLMM—a LLaVA-like LMM that utilizes a diffusion-based visual encoder instead of the standard CLIP visual encoder. This design enhances the implicit grounding ability without changing the training data. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach enables strong visual grounding while preserving general conversation capabilities. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 46.4 grounding mask recall on grounded conversation generation, outperforming the extensively supervised model GLaMM.

🚀 Attend-and-Segment: Unlocking Implicit Grounding

Attend-and-Segment

We introduce Attend-and-Segment, a training-free strategy to unlock the implicit grounding ability of LMMs. By analyzing the attention mechanism, we can determine "where the LMM is looking" and use this information to prompt a segmentation model like SAM to produce pixel-level grounding.

💎 DiffLMM: Diffusion-Based LMM for Enhanced Grounding

DiffLMM

We propose DiffLMM, a novel LMM architecture that leverages a diffusion-based visual encoder. This design provides stronger localized features compared to standard CLIP encoders, enhancing the model's implicit grounding capabilities without requiring additional training data.

🏆 Competitive Performance Without Grounding Supervision

Model Grounded Conversation Generation General VQA
Mask Recall mIoU METEOR VQAv2 MMBench MMStar
GLaMM 40.8 65.6 15.8 24.4 36.8 12.8
LLaVA-1.5 + a&s (Ours) 43.5 59.7 18.2 78.5 64.3 30.3
DiffLMM + a&s (Ours) 46.4 63.3 18.2 78.3 66.2 30.5

Our approach achieves state-of-the-art performance on grounded conversation generation while maintaining strong performance on general VQA benchmarks, all without any grounding supervision.

📚BibTeX

@inproceedings{cao2025emergent,
  title={Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision},
  author={Shengcao Cao and Liang-Yan Gui and Yu-Xiong Wang},
  booktitle={ICCV Findings},
  year={2025}
}