Multi-modal document retrieval focuses on identifying and retrieving diverse content types such as figures, tables, charts, and layout structures from long documents. Despite its importance, existing benchmarks fall short in offering comprehensive and fine-grained evaluation. To address this, we introduce MMDocIR, a large-scale benchmark that supports both page-level and layout-level retrieval. The page-level task aims to identify the most relevant pages for a given query, while the layout-level task targets finer units like paragraphs, tables, equations, or figures. MMDocIR consists of 1,685 expert-annotated and over 173,000 bootstrapped question-answer pairs grounded in multimodal content, making it a valuable resource for both training and evaluation. Each QA pair is associated with document layouts, bounding boxes, and modality tags. The benchmark spans a diverse set of document types and supports retrieval across text, image, and mixed modalities. MMDocIR provides a foundation for advancing research in fine-grained, layout-aware, and multimodal document retrieval.
MMDocIR is a large-scale benchmark designed to advance research in multi-modal document retrieval. It comprises both an evaluation set and a training set, featuring extensive coverage of document types, content modalities, and task complexity. The evaluation set contains 313 long documents with an average length of 65.1 pages, spanning ten distinct domains such as research reports, tutorials, legal documents, and financial statements. These documents offer a rich diversity of modalities, with 60.4% text, 18.8% images, 16.7% tables, and 4.1% layout or meta content. A total of 1,658 expert-annotated questions are paired with 2,107 page-level and 2,638 layout-level labels. The questions require various modalities for reasoning: 44.7% involve text, 37.4% tables, 21.7% images, and 11.5% layout/meta elements. Notably, the dataset introduces significant challenges with 313 multi-page questions, 254 cross-modal cases, and 637 questions requiring reasoning over multiple layout elements.
The training set consists of 6,878 documents and 73,843 QA pairs, collected from seven diverse DocVQA-related datasets, including MP-DocVQA, SlideVQA, TAT-DQA, ArXivQA, SciQAG, DUDE, and CUAD. These documents cover domains such as academic research, industry reports, legal contracts, and scientific publications, with average lengths ranging from 15 to 147 pages. This large and heterogeneous corpus enables robust model training and supports generalization across domains. Together, the MMDocIR dataset provides a comprehensive resource for benchmarking and developing multi-modal document retrieval systems that require fine-grained understanding across page, layout, and modality levels.
The annotation pipeline of MMDocIR includes three stages. 1. (1) Data collection: We collect 364 long documents and 2,193 QA pairs from MMLongBench-Doc and DocBench, selecting datasets that include accessible original documents, diverse domains (e.g., academic, legal, financial), and rich multi-modal content such as text, figures, tables, and layouts. The average document length exceeds 65 pages, ensuring the benchmark reflects real-world document complexity. (2) Question Filtering & Adaptation: To ensure alignment with retrieval objectives, we filter out questions that are unsuitable for document-based retrieval, including summarization-style queries, statistical aggregations, or those requiring external knowledge. Remaining questions are revised to ensure they target concrete, retrievable content within the documents. (3) Multi-level Annotation: We annotate each question with two types of evidence labels. Page-level Labels: Annotators identify the exact pages that contain information necessary to answer the question. For example, in multi-page documents, locating the correct evidence page requires meticulous reading and verification. Layout-level Labels: Using the MinerU parser, we extract bounding boxes for five types of layout elements (text, image, table, title, equation). Annotators then select the specific layouts that provide evidence for each question. In cases where MinerU fails to detect relevant content, manual annotation is performed to ensure precision. This results in 2,638 layout-level labels for 1,658 questions, capturing fine-grained evidence at the block level. (4) Quality Assurance: We implement a robust cross-validation process across two independent annotator groups. A 400-question overlap set is used for mutual validation, and an additional 50% of the annotations undergo random cross-checking. The final annotation consistency reaches 95.2 F1 for page-level and 87.1 F1 for layout-level labels, ensuring both reliability and accuracy.
We conduct comprehensive evaluations across page-level and layout-level retrieval using 11 retrievers (6 text-based and 5 visual-based). The retrievers are adapted for both tasks using dual-modality inputs (e.g., OCR-text, VLM-text, page screenshots). Our findings reveal: