MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Huawei Noah's Ark Lab

MMDocIR Construction

The annotation pipeline of MMDocIR includes three stages. 1. (1) Data collection: We collect 364 long documents and 2,193 QA pairs from MMLongBench-Doc and DocBench, selecting datasets that include accessible original documents, diverse domains (e.g., academic, legal, financial), and rich multi-modal content such as text, figures, tables, and layouts. The average document length exceeds 65 pages, ensuring the benchmark reflects real-world document complexity. (2) Question Filtering & Adaptation: To ensure alignment with retrieval objectives, we filter out questions that are unsuitable for document-based retrieval, including summarization-style queries, statistical aggregations, or those requiring external knowledge. Remaining questions are revised to ensure they target concrete, retrievable content within the documents. (3) Multi-level Annotation: We annotate each question with two types of evidence labels. Page-level Labels: Annotators identify the exact pages that contain information necessary to answer the question. For example, in multi-page documents, locating the correct evidence page requires meticulous reading and verification. Layout-level Labels: Using the MinerU parser, we extract bounding boxes for five types of layout elements (text, image, table, title, equation). Annotators then select the specific layouts that provide evidence for each question. In cases where MinerU fails to detect relevant content, manual annotation is performed to ensure precision. This results in 2,638 layout-level labels for 1,658 questions, capturing fine-grained evidence at the block level. (4) Quality Assurance: We implement a robust cross-validation process across two independent annotator groups. A 400-question overlap set is used for mutual validation, and an additional 50% of the annotations undergo random cross-checking. The final annotation consistency reaches 95.2 F1 for page-level and 87.1 F1 for layout-level labels, ensuring both reliability and accuracy.

MMDocIR Evaluation

We conduct comprehensive evaluations across page-level and layout-level retrieval using 11 retrievers (6 text-based and 5 visual-based). The retrievers are adapted for both tasks using dual-modality inputs (e.g., OCR-text, VLM-text, page screenshots). Our findings reveal:

  • Visual Superiority: Visual retrievers consistently outperform text-based retrievers in both page-level and layout-level tasks, confirming the value of preserving visual cues through page screenshots or layout images.
  • Impact of MMDocIR Training: Visual retrievers fine-tuned on the MMDocIR training set (e.g., Col-Phi3ours, DPR-Phi3ours) significantly outperform off-the-shelf models, validating the dataset’s utility for training robust multimodal retrievers.
Futher fine-grained study reveals that
  • VLM-text Advantage over OCR-text:Using GPT-4o-generated image descriptions (VLM-text) leads to better performance than standard OCR outputs, especially in capturing visual-semantic nuances missed by traditional text extractors.
  • Token-level vs. Dense Embeddings: Token-level retrievers (e.g., ColBERT, ColPali) show marginal performance gains in top-k recall, especially Recall@1, over dense embedding models—but incur a significant storage overhead (~10x larger index size).