Large vision-language models (LVLMs) achieve remarkable success across multimodal tasks, yet their internal decision-making processes remain opaque. Accuracy alone cannot reveal whether a correct prediction arises from genuine multimodal fusion, reliance on language priors, or visual evidence in isolation. Existing interpretability efforts often adopt a "micro-scope" focus—analyzing one modality in isolation—or introduce ad hoc metrics that lack firm theoretical support. We turn to partial information decomposition (PID) [1], a rigorous information-theoretic framework that decomposes the mutual information between two sources and a target into four non-negative atoms: redundancy R, vision uniqueness U1, language uniqueness U2, and synergy S.
Building on the BATCH estimator [2], we develop a model-agnostic pipeline that estimates PID quantities for LVLMs without changing architectures or retraining. We profile 26 models from 11 families (0.5B–90B parameters) on four benchmarks spanning general reasoning, hallucination evaluation, and domain-specific knowledge. Our analysis covers three complementary dimensions: breadth (cross-model & cross-task), depth (layer-wise information dynamics via logit lens), and time (learning dynamics across LLaVA-1.5's two-stage training).
Our study reveals two task regimes—synergy-driven vs. knowledge-driven—and two stable family-level strategies—fusion-centric vs. language-centric. We uncover a consistent three-phase pattern of information flow across layers and identify visual instruction tuning as the key stage where multimodal fusion is learned. Together, these findings provide a quantitative lens beyond accuracy-only evaluation, offering actionable insights for analyzing and designing the next generation of LVLMs.