MyNewdataset

Abstract

Document Visual Question Answering (DocVQA) faces dual challenges in pro- cessing lengthy multimodal documents (text, images, tables) and performing cross- modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing crit- ical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a com- prehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration. Key findings reveal advanced pro- prietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image de- scriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems.

🔥Highlight

Multimodal: MMDocRAG supports a novel answer generation paradigm where text, tables, charts, and images are interleaved within responses. This design enables more interpretable, verifiable, and context-rich answers, moving beyond pure-text outputs.
Evaluation: We introduce novel evaluation methodologies, including fine-grained quote selection under noisy conditions and holistic assessment of multimodal generation based on fluency, citation quality, text-image coherence, reasoning, and factual accuracy.
Analysis: Through extensive experiments across 60 cutting-edge models, we reveal that even state-of-the-art VLMs and LLMs struggle with multimodal integration, highlighting the necessity of targeted fine-tuning for advancing multimodal document understanding.

MMDocRAG Overview

The automatic understanding of long and complex documents with multimodal components remains a challenging yet crucial task. Despite recent advances in large vision-language models (LVLMs) and retrieval-augmented generation (RAG) techniques, existing benchmarks primarily focus on unimodal or short-context scenarios, lacking a comprehensive evaluation framework for long-context, multimodal document understanding.
To address this gap, we introduce MMDocRAG, a large-scale multimodal dataset comprising 4,055 expertly-annotated question-answer pairs based on 222 lengthy documents spanning 10 diverse domains. Each document averages 67 pages and approximately 33,000 words, and contains rich multimodal structures including text, tables, charts, and images. The questions are carefully curated or newly created by expert annotators, each supported by cross-page, cross-modal evidence chains. MMDocRAG also integrates 48,618 text quotes and 32,071 image quotes, with a balanced mixture of gold and hard negative samples to promote fine-grained quote selection. Notably, the dataset supports interleaved multimodal answer generation, enabling models to seamlessly integrate textual and visual evidence in their outputs. This design offers a realistic and comprehensive resource for advancing multimodal document understanding in long-context settings.

MMDocRAG Construction

The annotation pipeline of MMDocRAG includes three stages. (1) Document Parsing and Evidence Selection: We process 313 lengthy documents from the MMDocIR corpus using MinerU, segmenting them into semantically coherent quotes based on layout detection. Each quote is stored in text, OCR-text, and VLM-text formats, forming a multimodal evidence pool. (2) Multimodal Answer Generation: We refine 1,658 existing QA pairs and generate new QA pairs through VLM-based annotation, ensuring each question-answer pair is grounded in multimodal evidence and supports interleaved text-image generation. Questions span eight predefined types and are carefully revised for clarity, specificity, and multimodal richness. (3) Gold Quotes Citation: To enhance factual grounding and answer traceability, we automatically insert citations of gold text quotes into multimodal answers using dense retrieval and LLM selection, followed by expert verification to ensure citation accuracy and coherence. (4) Negative Quotes Augmentation: To increase retrieval difficulty, we augment candidate sets with hard negative quotes—irrelevant yet highly similar text and image segments—carefully mixed with gold quotes. Two candidate set versions (15 or 20 quotes) are constructed per question for fine-grained evaluation of quote selection capabilities.

MMDocRAG Evaluation

We conduct large-scale evaluation on 60 state-of-the-art models (LLMs and VLMs) using the MMDocRAG benchmark, under settings with 15 and 20 quotes as context for multimodal generation. Our evaluation covers quote selection accuracy, answer generation quality, input modality (pure-text vs interleaved multimodal), and text source (OCR vs VLM-generated). GPT-4.1 achieves the highest F1 (70.2) and answer quality score (4.14), outperforming other proprietary and open-source models. Proprietary VLMs generally outperform their LLM counterparts when using multimodal inputs, but incur higher computational costs. In contrast, smaller VLMs underperform across all metrics. Notably, Qwen LLMs significantly outperform their VLM equivalents, suggesting weaknesses in visual understanding.Further analysis shows that:

VLM-generated text contains richer multimodal cues than OCR-extracted text, leading to better performance in both image quote selection and answer generation. Fine-tuning improves model accuracy on multimodal reasoning tasks.
Quote selection accuracy is strongly influenced by quote position: early-position quotes are more likely to be selected, especially for image-based quotes.

Futher fine-grained study reveals that

Visual retrievers are better at retrieving image content, while text retrievers excel in textual content. A hybrid retriever offers a more balanced solution.
So-called “thinking models” do not yield significant improvements despite higher token cost, indicating limited benefit from explicit multi-step reasoning under current settings.

Case Study

We showcase representative examples and analyses to highlight how models handle multimodal retrieval, reasoning, and generation.

MMDocRAG: Benchmarking interleaved image and text generation