Revisiting RVL-CDIP
Fixing label errors and train-test overlap in document classification benchmark
Collaboration: Stefan Larson (Vanderbilt University)
Status: Under ACL Rolling Review for EACL 2026
Period: 2023 - Present
Problem Statement
RVL-CDIP is a widely-used document classification benchmark with 400,000 images across 16 classes. However, the dataset suffers from:
- Significant label errors affecting model evaluation
- Train-test overlap compromising benchmark integrity
- Lack of cleaned version for fair model comparison
Our Solution
1. Label Error Detection & Correction
Approach: CLIP-based outlier detection
- Generated CLIP embeddings for all documents
- Computed class centroids in embedding space
- Identified outliers based on distance from centroids
- Manual verification and relabeling of suspicious samples
2. Train-Test Duplicate Detection
Multi-stage pipeline:
Stage 1 - Feature Matching:
- Employed SuperGlue pre-trained model for feature-based similarity assessment
- Matched keypoints between document pairs
- Identified potential duplicates based on match confidence
Stage 2 - Efficient Similarity Search:
- Applied MinHash and Locality Sensitive Hashing (LSH) for efficient grouping
- Scaled to 400K documents without exhaustive pairwise comparison
- Generated candidate duplicate groups
Stage 3 - Refined Clustering:
- Used DBSCAN on candidate groups for accurate deduplication
- Separated true duplicates from near-duplicates
- Maintained document diversity while removing overlaps
3. Model Evaluation on Cleaned Data
Designed comprehensive evaluation scripts for state-of-the-art models:
- DiT (Document Image Transformer)
- Donut (OCR-free document understanding)
- LayoutLM (multimodal document model)
Compared performance on original vs. cleaned dataset to quantify impact of data quality issues.
Technical Stack
Similarity & Embeddings: OpenAI CLIP, SuperGlue
 Hashing: MinHash, LSH
 Clustering: DBSCAN
 Models Evaluated: DiT, Donut, LayoutLM
 Frameworks: PyTorch, Hugging Face Transformers, OpenCV
Impact
- Cleaned dataset available for research community
- Quantified impact of data quality on model performance
- Established good practices for large-scale document dataset curation