A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

1The University of Tokyo, 2Microsoft Research
*Corresponding authors: xufluo@microsoft.com, nakayama@ci.i.u-tokyo.ac.jp
ICLR 2026

TL;DR: LVLM predictions do not always reflect true multimodal fusion; reliance on language priors versus cross-modal synergy varies by task and model family.

Introduction

Large vision-language models (LVLMs) achieve remarkable success across multimodal tasks, yet their internal decision-making processes remain opaque. Accuracy alone cannot reveal whether a correct prediction arises from genuine multimodal fusion, reliance on language priors, or visual evidence in isolation. Existing interpretability efforts often adopt a "micro-scope" focus—analyzing one modality in isolation—or introduce ad hoc metrics that lack firm theoretical support. We turn to partial information decomposition (PID) [1], a rigorous information-theoretic framework that decomposes the mutual information between two sources and a target into four non-negative atoms: redundancy R, vision uniqueness U1, language uniqueness U2, and synergy S.

Building on the BATCH estimator [2], we develop a model-agnostic pipeline that estimates PID quantities for LVLMs without changing architectures or retraining. We profile 26 models from 11 families (0.5B–90B parameters) on four benchmarks spanning general reasoning, hallucination evaluation, and domain-specific knowledge. Our analysis covers three complementary dimensions: breadth (cross-model & cross-task), depth (layer-wise information dynamics via logit lens), and time (learning dynamics across LLaVA-1.5's two-stage training).

Our study reveals two task regimes—synergy-driven vs. knowledge-driven—and two stable family-level strategies—fusion-centric vs. language-centric. We uncover a consistent three-phase pattern of information flow across layers and identify visual instruction tuning as the key stage where multimodal fusion is learned. Together, these findings provide a quantitative lens beyond accuracy-only evaluation, offering actionable insights for analyzing and designing the next generation of LVLMs.

PID Decomposition

PID Venn diagram

Classical mutual information I(X1, X2; Y) quantifies the total information from a source, but cannot disentangle the complex interactions between multiple inputs such as vision and text. Partial Information Decomposition [1] decomposes this into four non-negative atoms:

  • Redundancy R — Information shared by both vision and language
  • Vision Uniqueness U1 — Information only the image provides
  • Language Uniqueness U2 — Information only the text provides
  • Synergy S — Information that emerges only from combining both modalities

PID quantifies whether a prediction is driven by unimodal priors or true multimodal fusion, providing a principled lens for probing LVLM internals.

Framework

Building on the BATCH estimator [2], we develop a model-agnostic framework to estimate PID quantities for LVLMs without changing model architectures or retraining.

Overview of the PID framework for LVLMs

(1) Given an image-text pair, we extract image and text embeddings as two features, run a standard multimodal forward pass and collect two unimodal predictions by masking the other modality. PID values are estimated with the BATCH estimator. (2) Three analysis dimensions: (a) cross-model and cross-task comparison across 11 families and 26 models on 4 datasets, (b) layer-wise information dynamics via logit lens, and (c) learning dynamics over two-stage training.

Findings

Two Task Regimes of Information Use

Correct LVLM predictions do not always arise from the same underlying information strategy.

Stable Family-Level Information Strategies

Model families separate into two stable strategies, consistent across task regimes.

How Fusion Emerges

Layer-wise dynamics and learning trajectories reveal where and when multimodal fusion arises.

Layer-Wise: A Three-Phase Pattern

By applying the logit lens to project hidden states at each transformer layer, we uncover a consistent three-phase pattern:

1. Silent Phase

Early layers: all PID ≈ 0

2. Language Build-up

Mid-to-late layers: U2 rises

3. Synergistic Fusion

Final layers: S peaks

Layer-wise PID dynamics (U2 and S)

Layer-wise dynamics of U2 and S for representative models. Top: MMBench (synergy-driven). Bottom: PMC-VQA (knowledge-driven).

Training Dynamics: Visual Instruction Tuning Unlocks Fusion

We trace PID through LLaVA-1.5's two-stage training. Multimodal fusion S is learned primarily during visual instruction tuning rather than alignment pretraining:

  • Stage 1 (Alignment Pretraining): Both S and U2 remain low and stable. Only the projector is trained.
  • Stage 2 (Visual Instruction Tuning): Both components increase markedly. The 7B model shows stronger gains in S, while the 13B model shows comparatively stronger growth in U2.
Learning dynamics of LLaVA-1.5

Evolution of S and U2 during two-stage training of LLaVA-1.5 (7B, 13B) on MMBench and PMC-VQA.

Conclusion & Discussion

Higher accuracy does not necessarily imply stronger multimodal fusion. PID complements accuracy-only evaluation by revealing whether success comes from true cross-modal interaction or language-side priors. We suggest using S and U2 as diagnostic signals for model analysis and scaling, and designing benchmarks that require high synergy rather than recoverable language priors.

This study has several limitations. PID estimation assumes a discrete target space, so our framework does not cover fully open-ended generation tasks. Additionally, our unimodal probes are approximate: masking a modality with calibrated noise stabilizes estimation, but U1, U2, and S are measured under this probe rather than under truly natural unimodal inputs.

Looking ahead, (U1, U2, S) can serve as diagnostic signals during scaling and instruction tuning, and potentially as auxiliary objectives to balance fusion and language priors. PID-based analyses can also guide the construction of benchmarks that explicitly require high synergy S or isolate language priors U2.

Qualitative Case Studies

Illustrating how different models process the same question

Case study: Geography question

Case 1: A geography question. Models with high synergy S correctly integrate map visuals with language context to identify "Massachusetts."

Case study: Spatial relation question

Case 2: A spatial relation question. The PID decomposition reveals how each model balances vision uniqueness, language priors, and synergistic fusion.

BibTeX

@inproceedings{xiu2026comprehensive,
  title={A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models},
  author={Xiu, Lixin and Luo, Xufang and Nakayama, Hideki},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

References

  1. Paul L. Williams and Randall D. Beer. Nonnegative decomposition of multivariate information. arXiv 2010.
  2. Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Nicholas Allen, Randy Auerbach, Faisal Mahmood, et al. Quantifying & modeling multimodal interactions: An information decomposition framework. NeurIPS 2023.