Document Visual Question Answering (DocVQA) faces dual challenges in pro- cessing lengthy multimodal documents (text, images, tables) and performing cross- modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing crit- ical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a com- prehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration. Key findings reveal advanced pro- prietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image de- scriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems.
The automatic understanding of long and complex documents with multimodal components remains a challenging yet crucial task. Despite recent advances in large vision-language models (LVLMs) and retrieval-augmented generation (RAG) techniques, existing benchmarks primarily focus on unimodal or short-context scenarios, lacking a comprehensive evaluation framework for long-context, multimodal document understanding.
To address this gap, we introduce MMDocRAG, a large-scale multimodal dataset comprising 4,055 expertly-annotated question-answer pairs based on 222 lengthy documents spanning 10 diverse domains. Each document averages 67 pages and approximately 33,000 words, and contains rich multimodal structures including text, tables, charts, and images. The questions are carefully curated or newly created by expert annotators, each supported by cross-page, cross-modal evidence chains. MMDocRAG also integrates 48,618 text quotes and 32,071 image quotes, with a balanced mixture of gold and hard negative samples to promote fine-grained quote selection. Notably, the dataset supports interleaved multimodal answer generation, enabling models to seamlessly integrate textual and visual evidence in their outputs. This design offers a realistic and comprehensive resource for advancing multimodal document understanding in long-context settings.
The annotation pipeline of MMDocRAG includes three stages.
(1) Document Parsing and Evidence Selection: We process 313 lengthy documents from the MMDocIR corpus using MinerU, segmenting them into semantically coherent quotes based on layout detection. Each quote is stored in text, OCR-text, and VLM-text formats, forming a multimodal evidence pool.
(2) Multimodal Answer Generation: We refine 1,658 existing QA pairs and generate new QA pairs through VLM-based annotation, ensuring each question-answer pair is grounded in multimodal evidence and supports interleaved text-image generation. Questions span eight predefined types and are carefully revised for clarity, specificity, and multimodal richness.
(3) Gold Quotes Citation: To enhance factual grounding and answer traceability, we automatically insert citations of gold text quotes into multimodal answers using dense retrieval and LLM selection, followed by expert verification to ensure citation accuracy and coherence.
(4) Negative Quotes Augmentation: To increase retrieval difficulty, we augment candidate sets with hard negative quotes—irrelevant yet highly similar text and image segments—carefully mixed with gold quotes. Two candidate set versions (15 or 20 quotes) are constructed per question for fine-grained evaluation of quote selection capabilities.
We conduct large-scale evaluation on 60 state-of-the-art models (LLMs and VLMs) using the MMDocRAG benchmark, under settings with 15 and 20 quotes as context for multimodal generation. Our evaluation covers quote selection accuracy, answer generation quality, input modality (pure-text vs interleaved multimodal), and text source (OCR vs VLM-generated). GPT-4.1 achieves the highest F1 (70.2) and answer quality score (4.14), outperforming other proprietary and open-source models. Proprietary VLMs generally outperform their LLM counterparts when using multimodal inputs, but incur higher computational costs. In contrast, smaller VLMs underperform across all metrics. Notably, Qwen LLMs significantly outperform their VLM equivalents, suggesting weaknesses in visual understanding.Further analysis shows that:
We showcase representative examples and analyses to highlight how models handle multimodal retrieval, reasoning, and generation.