This blueprint documents the architecture and implementation patterns for an invoice intake automation system. It's based on systems we've built for finance teams processing 500-5,000 invoices monthly.
The goal: reduce manual processing time by 70-80% while maintaining accuracy above 99% and providing full audit trails.
System Overview
The system has five main components:
- Intake Layer: Email monitoring, document extraction, deduplication
- Extraction Layer: OCR processing, field identification, confidence scoring
- Validation Layer: Data verification, three-way matching, exception flagging
- Processing Layer: GL coding, approval routing, workflow management
- Output Layer: ERP formatting, posting, audit logging
Component 1: Intake Layer
Email Monitoring
Most invoices arrive via email. The system monitors designated inboxes (e.g., invoices@company.com, ap@company.com).
Implementation details:
- Use Microsoft Graph API or Gmail API for mailbox access
- Poll every 2-5 minutes (or use webhooks if available)
- Filter for messages likely to contain invoices (subject lines, sender domains)
- Handle both attachments and embedded invoice content
Document Extraction
For each relevant email:
- Extract all PDF and image attachments
- Convert other formats (Word, Excel) to PDF
- Handle multi-page documents (may contain multiple invoices)
- Generate a unique document ID for tracking
Deduplication
Prevent processing the same invoice twice:
- Hash document content (not just filename)
- Check against recently processed documents
- Flag potential duplicates for human review rather than auto-rejecting
Component 2: Extraction Layer
OCR Processing
We use a combination of:
- Primary OCR: Azure Document Intelligence or Google Document AI
- Fallback OCR: Tesseract for edge cases or cost optimization
- LLM enhancement: GPT-4 for extracting structured data from raw OCR text
Field Extraction
Target fields with confidence scores:
| Field | Extraction Method | Confidence Factors |
|---|---|---|
| Vendor Name | Header detection + vendor master matching | Match quality, position consistency |
| Invoice Number | Pattern matching (alphanumeric near "Invoice #") | Format match, uniqueness |
| Invoice Date | Date parsing near "Date" or "Invoice Date" labels | Date validity, recency |
| Due Date | Date parsing or calculation from payment terms | Consistency with invoice date |
| Total Amount | Currency pattern near "Total" label | Match with line item sum |
| Tax Amount | Currency pattern near tax labels | Percentage consistency |
| Line Items | Table detection and parsing | Row consistency, math verification |
| PO Reference | Pattern matching for PO numbers | PO existence in system |
Confidence Scoring
Each field gets a confidence score (0-100):
- 90-100: High confidence, can auto-process
- 70-89: Medium confidence, may need verification
- Below 70: Low confidence, requires human review
Overall document confidence is the minimum of critical field confidences.
Component 3: Validation Layer
Data Validation Rules
Vendor validation:
- Vendor name matches a record in vendor master
- If no match, flag for vendor creation or manual matching
- Check vendor status (active, on hold, blocked)
Date validation:
- Invoice date is not in the future
- Invoice date is within reasonable past (e.g., not more than 90 days old without flag)
- Due date is after invoice date
Amount validation:
- Line items sum to subtotal
- Subtotal + tax = total
- Tax rate is within expected range for vendor/region
Three-Way Matching
For invoices with PO references:
- PO Match: Does the referenced PO exist and is it open?
- Goods Receipt Match: Has receiving confirmed delivery?
- Price Match: Does invoice amount match PO within tolerance?
Tolerance configuration:
{
"exact_match_threshold": 10000,
"tolerance_bands": [
{ "min": 0, "max": 1000, "tolerance_percent": 5 },
{ "min": 1000, "max": 10000, "tolerance_percent": 2 },
{ "min": 10000, "max": null, "tolerance_percent": 0 }
]
}
Exception Handling
Each validation failure generates an exception with:
- Exception code (e.g., "VENDOR_NOT_FOUND", "PO_AMOUNT_MISMATCH")
- Severity (blocking, warning, info)
- Details (expected vs. actual values)
- Suggested resolution
Component 4: Processing Layer
GL Coding
Automatic GL code assignment using:
- Explicit rules: "Vendor X always uses account 6100"
- Learned patterns: Historical coding for similar invoices
- LLM classification: Analyzing description text
Coding output includes:
- Suggested GL account
- Cost center
- Tax code
- Confidence score
- Reasoning (for audit trail)
Approval Workflow
Dynamic routing based on:
- Amount thresholds
- Cost center ownership
- Expense category
- Budget status
Workflow states:
PENDING_EXTRACTION → PENDING_VALIDATION → PENDING_CODING →
PENDING_APPROVAL → APPROVED → PENDING_POST → POSTED
Each state transition is logged with timestamp, actor, and action taken.
Component 5: Output Layer
ERP Formatting
Transform validated invoice data into ERP-specific format:
- Field mapping to ERP schema
- Code translation (internal codes to ERP codes)
- Format validation (string lengths, required fields)
Posting
Integration options:
- API posting: Direct ERP API calls
- File export: Generate import files for batch processing
- RPA: UI automation for legacy systems
All postings include:
- Idempotency check (prevent duplicate postings)
- Confirmation receipt
- Error handling with retry logic
Audit Logging
Every action is logged:
{
"document_id": "INV-2024-00123",
"timestamp": "2024-01-20T10:30:00Z",
"action": "FIELD_EXTRACTED",
"field": "total_amount",
"value": "1234.56",
"confidence": 95,
"method": "ocr_primary",
"actor": "system"
}
Performance Metrics
Target metrics for a mature implementation:
| Metric | Target | Acceptable |
|---|---|---|
| Straight-through processing rate | 75% | 60% |
| Extraction accuracy | 98% | 95% |
| Average processing time | < 2 min | < 5 min |
| Exception resolution time | < 4 hours | < 24 hours |
| Post-posting error rate | < 0.5% | < 1% |
Technology Stack
Recommended stack:
- Email integration: Microsoft Graph API / Gmail API
- OCR: Azure Document Intelligence
- LLM: GPT-4 or Claude for extraction enhancement
- Database: PostgreSQL with JSONB for flexible schema
- Queue: Redis or RabbitMQ for async processing
- API: Next.js API routes or FastAPI
- Admin UI: Next.js with React
Implementation Timeline
Typical implementation phases:
- Week 1-2: Blueprint and setup (requirements, architecture, dev environment)
- Week 3-4: Intake and extraction (email monitoring, OCR integration)
- Week 5-6: Validation and processing (matching rules, GL coding)
- Week 7-8: ERP integration and testing (output formatting, posting)
- Week 9-10: Pilot and refinement (production testing, threshold tuning)
Timeline varies based on ERP complexity and volume requirements.
Ready to build your invoice automation system? Get a free development plan →
