A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

TL;DR: LVLM predictions do not always reflect true multimodal fusion; reliance on language priors versus cross-modal synergy varies by task and model family.

Introduction

Large vision-language models (LVLMs) achieve remarkable success across multimodal tasks, yet their internal decision-making processes remain opaque. Accuracy alone cannot reveal whether a correct prediction arises from genuine multimodal fusion, reliance on language priors, or visual evidence in isolation. Existing interpretability efforts often adopt a "micro-scope" focus—analyzing one modality in isolation—or introduce ad hoc metrics that lack firm theoretical support. We turn to partial information decomposition (PID) [1], a rigorous information-theoretic framework that decomposes the mutual information between two sources and a target into four non-negative atoms: redundancy R, vision uniqueness U₁, language uniqueness U₂, and synergy S.

Building on the BATCH estimator [2], we develop a model-agnostic pipeline that estimates PID quantities for LVLMs without changing architectures or retraining. We profile 26 models from 11 families (0.5B–90B parameters) on four benchmarks spanning general reasoning, hallucination evaluation, and domain-specific knowledge. Our analysis covers three complementary dimensions: breadth (cross-model & cross-task), depth (layer-wise information dynamics via logit lens), and time (learning dynamics across LLaVA-1.5's two-stage training).

Our study reveals two task regimes—synergy-driven vs. knowledge-driven—and two stable family-level strategies—fusion-centric vs. language-centric. We uncover a consistent three-phase pattern of information flow across layers and identify visual instruction tuning as the key stage where multimodal fusion is learned. Together, these findings provide a quantitative lens beyond accuracy-only evaluation, offering actionable insights for analyzing and designing the next generation of LVLMs.

PID Decomposition

Classical mutual information I(X₁, X₂; Y) quantifies the total information from a source, but cannot disentangle the complex interactions between multiple inputs such as vision and text. Partial Information Decomposition [1] decomposes this into four non-negative atoms:

Redundancy R — Information shared by both vision and language
Vision Uniqueness U₁ — Information only the image provides
Language Uniqueness U₂ — Information only the text provides
Synergy S — Information that emerges only from combining both modalities

PID quantifies whether a prediction is driven by unimodal priors or true multimodal fusion, providing a principled lens for probing LVLM internals.

Framework

Building on the BATCH estimator [2], we develop a model-agnostic framework to estimate PID quantities for LVLMs without changing model architectures or retraining.

(1) Given an image-text pair, we extract image and text embeddings as two features, run a standard multimodal forward pass and collect two unimodal predictions by masking the other modality. PID values are estimated with the BATCH estimator. (2) Three analysis dimensions: (a) cross-model and cross-task comparison across 11 families and 26 models on 4 datasets, (b) layer-wise information dynamics via logit lens, and (c) learning dynamics over two-stage training.

Findings

Two Task Regimes of Information Use

Correct LVLM predictions do not always arise from the same underlying information strategy.

Synergy vs Language Uniqueness across datasets

We observe two recurring information-use regimes across our four benchmarks:

Synergy-driven regime (MMBench, POPE): Models rely more on strong cross-modal fusion. Accuracy correlates most strongly with synergy S (ρ ≈ 0.75, p < 0.001).

Knowledge-driven regime (Reefknot, PMC-VQA): Performance reflects stronger dependence on language-side knowledge U₂. Synergy helps but gains are bounded when language knowledge becomes the bottleneck.

Information Correlates of Accuracy Across Task Regimes

On synergy-driven benchmarks (MMBench, POPE), S is the strongest positive correlate of accuracy (ρ ≈ 0.75, p < 0.001), whereas total mutual information I(X₁, X₂; Y) is less consistent across datasets. This implies that top-performing models are not those with simply "more" information, but those that translate overlapping cues into effective cross-modal use.

On knowledge-driven benchmarks (Reefknot, PMC-VQA), U₂ becomes comparatively more informative, while S remains positively related to accuracy but is no longer dominant.

Stable Family-Level Information Strategies

Model families separate into two stable strategies, consistent across task regimes.

Fusion-centric families (InternVL2.5/3, Qwen2/2.5-VL, LLaVA-OV) show relatively high synergy S and lower language uniqueness U₂.

Language-centric families (Gemma3, Cambrian) rely more on language uniqueness U₂ with lower synergy S.

The relative positions of families remain similar across regimes, suggesting a stable family-level tendency rather than task-specific behavior.

Scaling Effects on Synergy-Driven Tasks

Within each family, accuracy differences between available sizes co-vary more closely with changes in synergy S than with changes in U₂. The share of language uniqueness U₂ does not systematically increase with size and often decreases.

Larger checkpoints that improve more in accuracy also exhibit larger increases in S. This confirms that on synergy-driven tasks, performance gains are more closely tied to stronger multimodal fusion than to further amplifying language-side priors.

How Fusion Emerges

Layer-wise dynamics and learning trajectories reveal where and when multimodal fusion arises.

Layer-Wise: A Three-Phase Pattern

By applying the logit lens to project hidden states at each transformer layer, we uncover a consistent three-phase pattern:

1. Silent Phase

Early layers: all PID ≈ 0

2. Language Build-up

Mid-to-late layers: U₂ rises

3. Synergistic Fusion

Final layers: S peaks

Layer-wise dynamics of U₂ and S for representative models. Top: MMBench (synergy-driven). Bottom: PMC-VQA (knowledge-driven).

Training Dynamics: Visual Instruction Tuning Unlocks Fusion

We trace PID through LLaVA-1.5's two-stage training. Multimodal fusion S is learned primarily during visual instruction tuning rather than alignment pretraining:

Stage 1 (Alignment Pretraining): Both S and U₂ remain low and stable. Only the projector is trained.
Stage 2 (Visual Instruction Tuning): Both components increase markedly. The 7B model shows stronger gains in S, while the 13B model shows comparatively stronger growth in U₂.

Evolution of S and U₂ during two-stage training of LLaVA-1.5 (7B, 13B) on MMBench and PMC-VQA.

Conclusion & Discussion

Higher accuracy does not necessarily imply stronger multimodal fusion. PID complements accuracy-only evaluation by revealing whether success comes from true cross-modal interaction or language-side priors. We suggest using S and U₂ as diagnostic signals for model analysis and scaling, and designing benchmarks that require high synergy rather than recoverable language priors.

This study has several limitations. PID estimation assumes a discrete target space, so our framework does not cover fully open-ended generation tasks. Additionally, our unimodal probes are approximate: masking a modality with calibrated noise stabilizes estimation, but U₁, U₂, and S are measured under this probe rather than under truly natural unimodal inputs.

Looking ahead, (U₁, U₂, S) can serve as diagnostic signals during scaling and instruction tuning, and potentially as auxiliary objectives to balance fusion and language priors. PID-based analyses can also guide the construction of benchmarks that explicitly require high synergy S or isolate language priors U₂.

Qualitative Case Studies

Illustrating how different models process the same question

Case 1: A geography question. Models with high synergy S correctly integrate map visuals with language context to identify "Massachusetts."

Case 2: A spatial relation question. The PID decomposition reveals how each model balances vision uniqueness, language priors, and synergistic fusion.

BibTeX

@inproceedings{xiu2026comprehensive,
  title={A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models},
  author={Xiu, Lixin and Luo, Xufang and Nakayama, Hideki},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

References

Paul L. Williams and Randall D. Beer. Nonnegative decomposition of multivariate information. arXiv 2010.
Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Nicholas Allen, Randy Auerbach, Faisal Mahmood, et al. Quantifying & modeling multimodal interactions: An information decomposition framework. NeurIPS 2023.