MyNewdataset

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Huawei Noah's Ark Lab

arXiv Code 🤗Dataset

Abstract

Multi-modal document retrieval focuses on identifying and retrieving diverse content types such as figures, tables, charts, and layout structures from long documents. Despite its importance, existing benchmarks fall short in offering comprehensive and fine-grained evaluation. To address this, we introduce MMDocIR, a large-scale benchmark that supports both page-level and layout-level retrieval. The page-level task aims to identify the most relevant pages for a given query, while the layout-level task targets finer units like paragraphs, tables, equations, or figures. MMDocIR consists of 1,685 expert-annotated and over 173,000 bootstrapped question-answer pairs grounded in multimodal content, making it a valuable resource for both training and evaluation. Each QA pair is associated with document layouts, bounding boxes, and modality tags. The benchmark spans a diverse set of document types and supports retrieval across text, image, and mixed modalities. MMDocIR provides a foundation for advancing research in fine-grained, layout-aware, and multimodal document retrieval.

🔥Highlight

Dual Granularity Retrieval: MMDocIR supports both page-level and layout-level retrieval, enabling coarse-to-fine evaluation. Unlike prior datasets that only locate relevant pages, MMDocIR also identifies precise visual regions (figures, tables, equations) critical to answering queries.
Expert-Annotated & Scalable Data: The benchmark includes a high-quality evaluation set with 1,658 expertly annotated QA pairs across 313 long documents, and a large-scale training set with 73,843 QA pairs converted from multiple DocQA datasets, making it suitable for both fine-grained evaluation and model pretraining.
Vision-Driven Advantage: Experiments reveal that visual retrievers leveraging VLMs significantly outperform text-based retrievers, highlighting the necessity of incorporating multimodal signals (not just OCR) for effective document understanding.

MMDocIR Overview

MMDocIR is a large-scale benchmark designed to advance research in multi-modal document retrieval. It comprises both an evaluation set and a training set, featuring extensive coverage of document types, content modalities, and task complexity. The evaluation set contains 313 long documents with an average length of 65.1 pages, spanning ten distinct domains such as research reports, tutorials, legal documents, and financial statements. These documents offer a rich diversity of modalities, with 60.4% text, 18.8% images, 16.7% tables, and 4.1% layout or meta content. A total of 1,658 expert-annotated questions are paired with 2,107 page-level and 2,638 layout-level labels. The questions require various modalities for reasoning: 44.7% involve text, 37.4% tables, 21.7% images, and 11.5% layout/meta elements. Notably, the dataset introduces significant challenges with 313 multi-page questions, 254 cross-modal cases, and 637 questions requiring reasoning over multiple layout elements.

The training set consists of 6,878 documents and 73,843 QA pairs, collected from seven diverse DocVQA-related datasets, including MP-DocVQA, SlideVQA, TAT-DQA, ArXivQA, SciQAG, DUDE, and CUAD. These documents cover domains such as academic research, industry reports, legal contracts, and scientific publications, with average lengths ranging from 15 to 147 pages. This large and heterogeneous corpus enables robust model training and supports generalization across domains. Together, the MMDocIR dataset provides a comprehensive resource for benchmarking and developing multi-modal document retrieval systems that require fine-grained understanding across page, layout, and modality levels.

The annotation pipeline of MMDocIR includes three stages. 1. (1) Data collection: We collect 364 long documents and 2,193 QA pairs from MMLongBench-Doc and DocBench, selecting datasets that include accessible original documents, diverse domains (e.g., academic, legal, financial), and rich multi-modal content such as text, figures, tables, and layouts. The average document length exceeds 65 pages, ensuring the benchmark reflects real-world document complexity. (2) Question Filtering & Adaptation: To ensure alignment with retrieval objectives, we filter out questions that are unsuitable for document-based retrieval, including summarization-style queries, statistical aggregations, or those requiring external knowledge. Remaining questions are revised to ensure they target concrete, retrievable content within the documents. (3) Multi-level Annotation: We annotate each question with two types of evidence labels. Page-level Labels: Annotators identify the exact pages that contain information necessary to answer the question. For example, in multi-page documents, locating the correct evidence page requires meticulous reading and verification. Layout-level Labels: Using the MinerU parser, we extract bounding boxes for five types of layout elements (text, image, table, title, equation). Annotators then select the specific layouts that provide evidence for each question. In cases where MinerU fails to detect relevant content, manual annotation is performed to ensure precision. This results in 2,638 layout-level labels for 1,658 questions, capturing fine-grained evidence at the block level. (4) Quality Assurance: We implement a robust cross-validation process across two independent annotator groups. A 400-question overlap set is used for mutual validation, and an additional 50% of the annotations undergo random cross-checking. The final annotation consistency reaches 95.2 F1 for page-level and 87.1 F1 for layout-level labels, ensuring both reliability and accuracy.

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Abstract

🔥Highlight

MMDocIR Overview

MMDocIR Construction

MMDocIR Evaluation