Background
Blueprint: Automating Invoice Intake with OCR and Validation
Back to Blog

Blueprint: Automating Invoice Intake with OCR and Validation

5 min read

This blueprint documents the architecture and implementation patterns for an invoice intake automation system. It's based on systems we've built for finance teams processing 500-5,000 invoices monthly.

The goal: reduce manual processing time by 70-80% while maintaining accuracy above 99% and providing full audit trails.

System Overview

The system has five main components:

  1. Intake Layer: Email monitoring, document extraction, deduplication
  2. Extraction Layer: OCR processing, field identification, confidence scoring
  3. Validation Layer: Data verification, three-way matching, exception flagging
  4. Processing Layer: GL coding, approval routing, workflow management
  5. Output Layer: ERP formatting, posting, audit logging

Component 1: Intake Layer

Email Monitoring

Most invoices arrive via email. The system monitors designated inboxes (e.g., invoices@company.com, ap@company.com).

Implementation details:

  • Use Microsoft Graph API or Gmail API for mailbox access
  • Poll every 2-5 minutes (or use webhooks if available)
  • Filter for messages likely to contain invoices (subject lines, sender domains)
  • Handle both attachments and embedded invoice content

Document Extraction

For each relevant email:

  • Extract all PDF and image attachments
  • Convert other formats (Word, Excel) to PDF
  • Handle multi-page documents (may contain multiple invoices)
  • Generate a unique document ID for tracking

Deduplication

Prevent processing the same invoice twice:

  • Hash document content (not just filename)
  • Check against recently processed documents
  • Flag potential duplicates for human review rather than auto-rejecting

Component 2: Extraction Layer

OCR Processing

We use a combination of:

  • Primary OCR: Azure Document Intelligence or Google Document AI
  • Fallback OCR: Tesseract for edge cases or cost optimization
  • LLM enhancement: GPT-4 for extracting structured data from raw OCR text

Field Extraction

Target fields with confidence scores:

Field Extraction Method Confidence Factors
Vendor Name Header detection + vendor master matching Match quality, position consistency
Invoice Number Pattern matching (alphanumeric near "Invoice #") Format match, uniqueness
Invoice Date Date parsing near "Date" or "Invoice Date" labels Date validity, recency
Due Date Date parsing or calculation from payment terms Consistency with invoice date
Total Amount Currency pattern near "Total" label Match with line item sum
Tax Amount Currency pattern near tax labels Percentage consistency
Line Items Table detection and parsing Row consistency, math verification
PO Reference Pattern matching for PO numbers PO existence in system

Confidence Scoring

Each field gets a confidence score (0-100):

  • 90-100: High confidence, can auto-process
  • 70-89: Medium confidence, may need verification
  • Below 70: Low confidence, requires human review

Overall document confidence is the minimum of critical field confidences.

Component 3: Validation Layer

Data Validation Rules

Vendor validation:

  • Vendor name matches a record in vendor master
  • If no match, flag for vendor creation or manual matching
  • Check vendor status (active, on hold, blocked)

Date validation:

  • Invoice date is not in the future
  • Invoice date is within reasonable past (e.g., not more than 90 days old without flag)
  • Due date is after invoice date

Amount validation:

  • Line items sum to subtotal
  • Subtotal + tax = total
  • Tax rate is within expected range for vendor/region

Three-Way Matching

For invoices with PO references:

  1. PO Match: Does the referenced PO exist and is it open?
  2. Goods Receipt Match: Has receiving confirmed delivery?
  3. Price Match: Does invoice amount match PO within tolerance?

Tolerance configuration:

{
  "exact_match_threshold": 10000,
  "tolerance_bands": [
    { "min": 0, "max": 1000, "tolerance_percent": 5 },
    { "min": 1000, "max": 10000, "tolerance_percent": 2 },
    { "min": 10000, "max": null, "tolerance_percent": 0 }
  ]
}

Exception Handling

Each validation failure generates an exception with:

  • Exception code (e.g., "VENDOR_NOT_FOUND", "PO_AMOUNT_MISMATCH")
  • Severity (blocking, warning, info)
  • Details (expected vs. actual values)
  • Suggested resolution

Component 4: Processing Layer

GL Coding

Automatic GL code assignment using:

  1. Explicit rules: "Vendor X always uses account 6100"
  2. Learned patterns: Historical coding for similar invoices
  3. LLM classification: Analyzing description text

Coding output includes:

  • Suggested GL account
  • Cost center
  • Tax code
  • Confidence score
  • Reasoning (for audit trail)

Approval Workflow

Dynamic routing based on:

  • Amount thresholds
  • Cost center ownership
  • Expense category
  • Budget status

Workflow states:

PENDING_EXTRACTION → PENDING_VALIDATION → PENDING_CODING → 
PENDING_APPROVAL → APPROVED → PENDING_POST → POSTED

Each state transition is logged with timestamp, actor, and action taken.

Component 5: Output Layer

ERP Formatting

Transform validated invoice data into ERP-specific format:

  • Field mapping to ERP schema
  • Code translation (internal codes to ERP codes)
  • Format validation (string lengths, required fields)

Posting

Integration options:

  • API posting: Direct ERP API calls
  • File export: Generate import files for batch processing
  • RPA: UI automation for legacy systems

All postings include:

  • Idempotency check (prevent duplicate postings)
  • Confirmation receipt
  • Error handling with retry logic

Audit Logging

Every action is logged:

{
  "document_id": "INV-2024-00123",
  "timestamp": "2024-01-20T10:30:00Z",
  "action": "FIELD_EXTRACTED",
  "field": "total_amount",
  "value": "1234.56",
  "confidence": 95,
  "method": "ocr_primary",
  "actor": "system"
}

Performance Metrics

Target metrics for a mature implementation:

Metric Target Acceptable
Straight-through processing rate 75% 60%
Extraction accuracy 98% 95%
Average processing time < 2 min < 5 min
Exception resolution time < 4 hours < 24 hours
Post-posting error rate < 0.5% < 1%

Technology Stack

Recommended stack:

  • Email integration: Microsoft Graph API / Gmail API
  • OCR: Azure Document Intelligence
  • LLM: GPT-4 or Claude for extraction enhancement
  • Database: PostgreSQL with JSONB for flexible schema
  • Queue: Redis or RabbitMQ for async processing
  • API: Next.js API routes or FastAPI
  • Admin UI: Next.js with React

Implementation Timeline

Typical implementation phases:

  1. Week 1-2: Blueprint and setup (requirements, architecture, dev environment)
  2. Week 3-4: Intake and extraction (email monitoring, OCR integration)
  3. Week 5-6: Validation and processing (matching rules, GL coding)
  4. Week 7-8: ERP integration and testing (output formatting, posting)
  5. Week 9-10: Pilot and refinement (production testing, threshold tuning)

Timeline varies based on ERP complexity and volume requirements.

Ready to build your invoice automation system? Get a free development plan →