Automated PII De-Identification for Document Datasets

Large-scale pipeline for detecting and removing sensitive personal information from 400,000+ documents with synthetic data replacement

Collaboration: Stefan Larson, Vanderbilt University | 2023 - 2024
Published: EMNLP 2024 (Main Conference)
Scale: 400,000+ documents across 5 datasets

Overview

Developed a comprehensive automated pipeline for detecting and de-identifying sensitive Personally Identifiable Information (PII) in large-scale document image datasets derived from IIT-CDIP. The system combines OCR, regex pattern matching, Named Entity Recognition, and computer vision techniques to process over 400,000 documents while preserving dataset utility through intelligent pseudonymization.

Problem Statement

Public document understanding datasets like RVL-CDIP, DocVQA, FUNSD, and Tobacco3482/800 contain alarming amounts of sensitive PII including:

  • 2,428 US Social Security Numbers across all datasets
  • 13,101 birth dates and 6,284 birth places
  • 4,192 home addresses and 1,948 home phone numbers
  • Citizenship, marital status, religious affiliation, and health information

The presence of such data violates contemporary research ethics guidelines and exposes individuals to identity theft risks, necessitating immediate action.

System Architecture

1. Large-Scale OCR Processing

Amazon Textract Deployment:

  • Processed 417,738 document images across 5 datasets (RVL-CDIP, DocVQA, FUNSD, Tobacco3482, Tobacco800)
  • Extracted text with precise bounding box coordinates for spatial localization
  • Achieved robust performance on diverse document types: resumes, forms, invoices, letters

Infrastructure:

  • Batch processing pipeline for efficient large-scale operations
  • Error handling for degraded or low-quality scanned documents
  • Preservation of layout information for downstream processing

2. Multi-Modal PII Detection Pipeline

Stage 1 - Regex Pattern Matching:

  • Detected structured PII with high precision:
    • SSNs: XXX-XX-XXXX format variations
    • Phone numbers: Multiple format patterns
    • Dates: MM/DD/YYYY and variations
    • Email addresses and structured identifiers
  • 97% document-level recall on SSN detection (Presidio analyzer)

Stage 2 - Named Entity Recognition:

  • Integrated spaCy and Transformer-based NER models
  • Detected unstructured PII:
    • Person names, locations, organizations
    • Citizenship and nationality references
    • Religious affiliations and demographic attributes
  • Enhanced with keyword-based contextual searches

Stage 3 - Manual Verification:

  • Team of 9 annotators with expert oversight
  • Fleiss’ Kappa: 0.918 inter-annotator agreement
  • Document-by-document inspection for smaller datasets
  • Strategic sampling for RVL-CDIP’s 400K documents

3. De-Identification Strategies

Approach 1 - Basic Redaction:

  • Black/white pixel overlays on sensitive regions
  • Simple but reduces document utility (8-12% classifier accuracy drop)

Approach 2 - Pseudonymization with Synthetic Data (Implemented):

Data Generation:

  • Faker library for realistic synthetic replacements:
    • Valid SSN format patterns (e.g., 123-45-6789)
    • Contextually appropriate names and addresses
    • Realistic dates maintaining temporal consistency
  • Gazetteer sampling for categorical data (nationalities, religions)

Visual Integration:

  • Pillow library for text rendering with multiple font types
  • Augraphy for document-specific augmentations:
    • InkMottling for print degradation effects
    • LowInkRandomLines for scanning artifacts
  • Albumentations for rotation and noise augmentation
  • Font-aware rendering preserving original document aesthetics

Intelligent Inpainting:

  • OpenCV inpainting for seamless background reconstruction
  • Telea and Navier-Stokes methods for texture preservation
  • Maintained visual coherence around redacted regions

4. Font Detection Module (In Development)

  • Custom CNN-based font type classifier
  • Matches original document font characteristics
  • Ensures layout consistency:
    • Font size matching
    • Character spacing preservation
    • Text alignment maintenance

Key Findings

PII Distribution Analysis

Dataset Size SSNs Birth Dates Home Addresses Total PII Documents
RVL-CDIP 400,000 2,342 12,800 3,908 15,956 (4.0%)
DocVQA 12,767 70 232 276 360 (2.8%)
Tobacco3482 3,482 9 62 6 66 (1.9%)
Tobacco800 1,290 5 7 1 12 (0.9%)
FUNSD 199 2 0 1 2 (1.0%)

Overall: 16,396 documents (3.9%) contain sensitive PII

Detection Performance

Automated Tool Comparison (Document-level recall on SSNs):

Tool Recall Notes
Presidio 0.97 Best performance
Google DLP 0.93 Strong performance
Azure 0.70 Limited context window issues
Amazon 0.77 Missing context keywords

Limitation: 7 out of 11 PII types not supported by existing tools, necessitating manual annotation.

Impact on Model Performance

Document Similarity (CLIP ViT-32 embeddings):

  • Black redactions: Significant dissimilarity (mean cosine sim: 0.964)
  • White redactions: Moderate dissimilarity (mean cosine sim: 0.989)
  • Pseudonymization: Highest similarity (mean cosine sim: 0.997)
  • Synthetic replacement 61.6% more similar than white redaction

Classification Performance (DiT-base on RVL-CDIP):

  • Zero label flips: All 445 test documents maintained correct class predictions
  • Minimal confidence impact: Mean confidence difference 0.0024 (pseudonymization)
  • Preserved utility: <2% accuracy degradation vs. 8-12% for naive redaction

Key Insight: Pseudonymization with synthetic data preserves semantic meaning while removing privacy risks.

Technical Implementation

OCR & Text Extraction: Amazon Textract
Pattern Matching: Python regex, Presidio
NER: spaCy, Hugging Face Transformers
Synthetic Data: Faker library, custom gazetteers
Image Processing: OpenCV (inpainting), Pillow (text rendering)
Augmentation: Augraphy (document-specific), Albumentations
Font Detection: Custom CNN (PyTorch)
Classification Evaluation: DiT-base (Microsoft), CLIP ViT-32

Broader Impact

Privacy Protection

  • Removed 2,428 SSNs from public datasets
  • Minimized exposure of 16,000+ documents with sensitive PII
  • Aligned datasets with NeurIPS and ACL ethics guidelines

Dataset Utility Preservation

  • Maintained document semantics for ML research
  • Enabled continued use for:
    • Document classification benchmarking
    • Information extraction tasks
    • Visual question answering
    • Multimodal model training

Research Contribution

  • First comprehensive audit of PII in IIT-CDIP-derived datasets
  • Novel pseudonymization approach for document images
  • Public release of de-identified datasets for research community
  • Best practices for large-scale document dataset curation

Ethical Considerations

  • Immediate action: Coordinated with Hugging Face to disable dataset previews
  • Responsible disclosure: Contacted dataset hosts before publication
  • Limited annotator exposure: Small team with expert oversight to minimize PII spread
  • Manual verification: Ensured high-quality detection despite

computational cost

Future Directions

  • Font style transfer methods for improved visual fidelity (e.g., GANs, style transfer networks)
  • Automated font detection at scale using deep learning
  • Extension to born-digital documents (PDFs, Word documents)
  • Privacy-preserving embeddings for training without exposing raw data
  • Differential privacy integration for formal privacy guarantees

Publication

This work was published at EMNLP 2024 (Main Conference):

(Larson et al., 2024)

References

2024

  1. EMNLP Main
    deidentification.png
    De-Identification of Sensitive Personal Data in Datasets Derived from IIT-CDIP
    Stefan Larson, Nicole Cornehl Lima, Santiago Pedroza Diaz, and 9 more authors
    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024