📄 document-processing-api-executive-summary.md 14,770 bytes Apr 19, 2026 📋 Raw

Executive Summary: Private Document Processing API

Product Definition

What: A self-hosted document OCR + extraction API that processes images/PDFs through a local vision model (qwen3-vl:8b) and returns structured JSON. Zero cloud vision processing — documents never leave your infrastructure. Designed for privacy-conscious businesses (legal, healthcare-adjacent, financial services, EU GDPR compliance).

Current State: 486 lines of working code (document_sorter.py) that:
- Receives images via Telegram DM → saves to /tmp/dropbox/ → base64 encodes → sends to Gaming PC Ollama qwen3-vl:8b
- Extracts: {vendor, date, category, amount} with strict taxonomy
- Builds filename: YYYY-MM-DD_Vendor_Category_$Amount.ext
- Uploads to Google Drive (currently blocked — Google account suspended)
- Cleans up temp files in finally block
- Uses keep_alive: 0 to immediately unload vision model and free VRAM

Gap to Revenue: The code works for personal use. To sell it as an API service requires: always-on GPU, multi-tenant auth, web API (not Telegram), billing, PDF support, and output storage (not Google Drive).

Market Analysis

Cloud OCR Pricing (competitive benchmarks)

Provider	Basic OCR	Forms/Tables	Custom Extraction	Notes
AWS Textract	$1.50/1K pages	$65/1K pages	N/A	97% markup for structured data
Google Document AI	$1.50/1K pages	$10/1K pages	$30/1K pages	$300 free credit
Azure Document Intelligence	$1.50/1K pages	$10/1K pages	$30/1K pages	65% discount at commitment tier
Mistral OCR 3	$2/1K pages	$2/1K pages	$2/1K pages	Batch: $1/1K pages (50% off)

The privacy moat: None of these providers can say "your documents never touch a cloud server." For legal firms, healthcare-adjacent services, financial advisors, and EU businesses under GDPR — this is compliance, not preference.

Target Market Segments

Segment	Monthly Volume	Willingness to Pay	Key Pain Point
Small law firms (1-5 attorneys)	500-2,000 pages	$49-99/mo	Cloud OCR = malpractice risk
Financial advisors (RIA)	1,000-5,000 pages	$99-299/mo	Client data in AWS = compliance nightmare
Healthcare-adjacent (billing, admin)	2,000-10,000 pages	$199-499/mo	HIPAA adjacent, can't risk breaches
EU consultancies	500-3,000 pages	$79-149/mo	GDPR, data sovereignty
Self-hosting enthusiasts	100-1,000 pages	$29-49/mo	Ideological, not price-sensitive

Market validation:
- Paperless-ngx has 50K+ GitHub stars — proven demand for self-hosted document management
- CleanRoll.ai raised funding for CRE rent roll extraction — validates OCR-as-a-service niche
- Mistral OCR 3 launched December 2025 with $1/1K pages pricing — proves market pressure on cloud pricing

Technical Architecture

Current Implementation (v0.1 - Personal Use)

Telegram DM (image) → /tmp/dropbox/ → base64 → HTTP POST to Gaming PC Ollama
                                              ↓
                                        qwen3-vl:8b inference (120s timeout)
                                              ↓
                                        JSON extraction → filename builder
                                              ↓
                                        Google Drive upload (BLOCKED)
                                              ↓
                                        👍 Telegram reaction

Hardware dependency: Gaming PC (3080 Ti, 12GB VRAM) must be ON and connected via Tailscale. Windows + Ollama + keep_alive: 0.

Required Architecture (v1.0 - Revenue Service)

Client POST /api/v1/extract (image/PDF + api_key)
              ↓
        FastAPI auth layer (rate limit, tenant isolation)
              ↓
        PDF → image conversion (if needed) via pdf2image
              ↓
        Local Ollama OR always-on GPU endpoint
              ↓
        qwen3-vl:8b inference (15-30s per page)
              ↓
        JSON extraction → structured output
              ↓
        MinIO S3-compatible storage (self-hosted, not AWS)
              ↓
        Webhook callback OR polling endpoint for results
              ↓
        Stripe metered billing (per-page + monthly base)

Critical change: Replace Google Drive with MinIO (self-hosted S3-compatible object storage) or local filesystem with CDN. Never touch AWS/GCP/Azure for file storage.

Gap Analysis: Current → Revenue

Component	Status	Effort	Blocker
Vision model inference	✅ Works (Gaming PC)	—	Must be always-on
Multi-page PDF support	❌ Not built	3-4 days	pdf2image + page iteration
API authentication	❌ Not built	2-3 days	JWT or API key auth, tenant isolation
Rate limiting / quotas	❌ Not built	2-3 days	Redis or in-memory tracking
Async job queue	❌ Not built	4-5 days	Celery + Redis or FastAPI background tasks
Result storage (MinIO)	❌ Not built	2-3 days	Self-hosted S3-compatible storage
Webhook callbacks	❌ Not built	1-2 days	POST to client endpoint with results
Billing (Stripe)	❌ Not built	3-4 days	Metered billing, usage tracking
Dashboard / status page	❌ Not built	5-7 days	Web UI for job status, usage, API keys
PDF preprocessing	❌ Not built	3-4 days	Deskew, denoise, OCR optimization
Error handling / retries	⚠️ Partial	2-3 days	Dead letter queue, client alerts

Total engineering effort: 4-6 weeks for MVP (one person, nights/weekends).

Hardware Investment Required

Current Setup Bottlenecks

Issue	Current State	Impact
Gaming PC not always-on	3080 Ti sleeps, wakes on demand	OCR unavailable 60%+ of day
Windows power management	Sleep mode, updates	Unpredictable downtime
Tailscale dependency	Windows → Beelink → Internet	Two points of failure
No UPS	Power outage = data loss	Unacceptable for paid service

Recommended Hardware Upgrades

Option A: Dedicated GPU Server (Recommended)

Component	Cost	Purpose
Used RTX 3060 12GB (eBay)	$180-220	Dedicated inference GPU, 24/7 operation
Low-power x86 SFF PC (used Dell/HP)	$150-250	Host for GPU, headless Ubuntu
650W PSU (if not included)	$50-80	Power for GPU
PCIe riser cable (if SFF)	$25-40	Physical fit
1TB NVMe SSD	$80-100	Model storage + job queue
Total	$485-690	Always-on inference server

Power consumption: ~150W under load, ~30W idle = ~$15-25/mo electricity.

Alternative: Used RTX 2060 Super 8GB ($120-150) — enough for qwen3-vl:8b, cheaper entry.

Option B: NVIDIA Jetson Orin Nano

Component	Cost	Notes
Jetson Orin Nano 8GB Dev Kit	$499	ARM, lower power (~25W max)
256GB NVMe	$40	Storage
Total	$539	Lower power, ARM ecosystem

Tradeoffs:
- Pros: Lower power (~$5/mo), smaller footprint, purpose-built for edge AI
- Cons: ARM architecture (some Python wheels don't exist), slower inference than desktop GPU, 8GB RAM limits concurrent jobs

Recommendation: Option A (used RTX 3060 + SFF PC). More flexible, faster inference, easier troubleshooting.

Option C: Upgrade Beelink (Not Recommended)

Intel N150 has no PCIe slot for GPU. External GPU via Thunderbolt/USB4: $300 enclosure + $200 GPU = $500, more complex, lower bandwidth. Skip.

Cost Model: Self-Hosted vs Cloud

Monthly Operating Costs (Self-Hosted)

Cost	Amount	Notes
Electricity (150W × 24h × 30d)	$20-30	At $0.15/kWh
Internet (already paid)	$0	Home connection sufficient
Domain + Cloudflare (already paid)	$0	Existing setup
Hardware depreciation ($600 / 36mo)	$17	3-year lifespan
Total monthly COGS	$37-47	Per-tenant marginal cost ≈ $0

Pricing Strategy

Target: Undercut cloud providers by 50% while offering privacy premium.

Tier	Price	Includes	Cloud Equivalent
Starter	$29/mo	1,000 pages, 1 user, email support	AWS: $65-1,500
Professional	$79/mo	5,000 pages, 3 users, webhooks, SLA	AWS: $325-7,500
Business	$199/mo	20,000 pages, 10 users, API access, priority	AWS: $1,300-30,000
Enterprise	$499+/mo	Unlimited pages, custom models, dedicated infra	AWS: custom quote

Break-even: At $79/mo × 10 customers = $790/mo revenue. COGS $47/mo. Gross margin 94%.

Dev/Test Cycles

Phase 1: Hardware (Week 1)

Acquire used RTX 3060 + SFF PC
Install Ubuntu 22.04 LTS, Ollama, qwen3-vl:8b
Verify inference speed: target <30s per page
Configure Tailscale static IP or Cloudflare Tunnel

Phase 2: Core API (Weeks 2-3)

FastAPI scaffolding with auth (API keys)
Image upload endpoint (sync) → returns job_id
Async job processing with Celery + Redis
MinIO setup for result storage
Webhook callback system

Phase 3: PDF Support (Week 4)

pdf2image integration for multi-page PDFs
Page-by-page processing with progress tracking
Zip output for multi-page docs
Bulk upload endpoint

Phase 4: Billing (Week 5)

Stripe metered billing integration
Usage tracking per API key
Automatic overage handling
Invoice generation

Phase 5: Dashboard (Weeks 6-7)

Minimal web UI: job status, usage charts, API key management
Status page with uptime metrics
Error logs view (sanitized)

Phase 6: Security Hardening (Week 8)

Rate limiting (prevent abuse)
Input validation (prevent injection)
Audit logging (who processed what when)
TLS termination via Cloudflare Tunnel
Fail2ban for SSH/API brute force

Testbench Demo: Go/No-Go Criteria

Test 1: Inference Performance

Setup: Single page receipt, qwen3-vl:8b, RTX 3060 12GB

Target:
- Cold start (model not loaded): <60 seconds
- Warm inference: <20 seconds per page
- Concurrent requests (3): <90 seconds each

Go if: Average <30s per page under load

Test 2: Accuracy Benchmark

Dataset: 100 documents (mix of receipts, invoices, contracts)

Target:
- Vendor name extraction: >90% accuracy
- Date extraction: >95% accuracy (correct format)
- Amount extraction: >95% accuracy (within $0.01)
- Category classification: >85% accuracy

Go if: Overall field extraction >90% without human correction

Test 3: Uptime & Reliability

Duration: 7 days continuous operation

Target:
- Uptime: >99% (excluding planned maintenance)
- Zero memory leaks (Ollama stays responsive)
- Graceful degradation under load (queue management)
- Automatic recovery from GPU OOM

Go if: Zero unplanned outages, <5min recovery time

Test 4: End-to-End Latency

Scenario: Client POST → processing → webhook callback

Target:
- P50 latency: <45 seconds
- P95 latency: <120 seconds
- P99 latency: <300 seconds (large PDFs)

Go if: P95 <120s for single-page documents

Test 5: Cost Validation

Measurement: 30-day electricity + bandwidth

Target:
- Electricity: <$40/mo
- Bandwidth: <100GB/mo (no overage)
- Hardware: no thermal throttling, <80°C GPU

Go if: Monthly COGS <$50

Risk Assessment

Risk	Likelihood	Impact	Mitigation
Hardware failure (GPU/SSD)	Medium	High	Hot spare, automated backups, 2-day replacement
Ollama/qwen3-vl model update breaks API	Low	High	Pin model version, staged rollout, rollback plan
Customer data breach (local)	Low	Critical	Encryption at rest, no remote access, audit logs
Power outage	Medium	Medium	UPS (CyberPower 1500VA, $150), graceful shutdown
Internet outage	Low	High	4G failover (optional), queue-and-retry
Legal liability (OCR error)	Medium	Medium	Terms of service disclaimer, $ liability cap
Stripe account freeze	Low	High	Multi-processor backup (LemonSqueezy)
Beelink SSD death (cascading failure)	Medium	High	Daily backups to Gaming PC + cloud (encrypted)

Recommendation

Go/No-Go Decision: CONDITIONAL GO

Proceed if:
1. ✅ Can invest $500-700 in dedicated GPU hardware within 30 days
2. ✅ Willing to spend 6-8 weeks part-time on MVP
3. ✅ First 3 paying customers identified (even if just "would you pay for this?" conversations)
4. ✅ Accept 6-month payback period on hardware

Defer if:
- ❌ Cannot guarantee always-on GPU (Gaming PC unreliable)
- ❌ Not willing to build web UI (Telegram-only won't scale to B2B)
- ❌ No LLC/liability protection (healthcare/legal adjacent customers)

Phased Approach

Phase 0 (Now):
- Validate demand: 5 conversations with law firms/financial advisors
- Price test: "Would you pay $79/mo for unlimited private OCR?"
- Build waitlist

Phase 1 (Month 1):
- Buy hardware, set up dedicated inference server
- Build async API + MinIO storage
- Dogfood with personal documents

Phase 2 (Month 2):
- Stripe integration, billing
- Onboard 3 beta customers at $29/mo (discounted)
- Iterate on extraction accuracy

Phase 3 (Month 3):
- Dashboard web UI
- Public launch at $79/mo
- Target: 10 customers ($790/mo) → break even

Conservative projection: 10 customers by Month 6 = $790/mo revenue, $47/mo COGS, $743/mo gross profit. Annual: ~$8,900 gross profit on ~$700 hardware investment.

Appendix: Competitive Moat Analysis

Why customers choose self-hosted over cloud:

Customer Type	Cloud Fear	Our Pitch
Law firm	Malpractice if client data leaked	"Documents never leave your server. Zero cloud touch."
Financial advisor	SEC audit, client trust	"Audit trail shows local processing only."
Healthcare admin	HIPAA violation ($1.5M fine)	"No BAA needed — no third-party processing."
EU consultancy	GDPR Article 44 (data transfers)	"Data sovereignty guaranteed. EU server option available."
Privacy enthusiast	Surveillance capitalism	"Open source, self-hosted, auditable code."

Differentiation from Paperless-ngx:
- Paperless: document management (storage, tagging, search)
- Us: document processing API (OCR, extraction, structured output)
- Complementary: customers use both

Differentiation from cloud OCR:
- Cloud: 99.9% uptime, infinite scale, higher cost, privacy risk
- Us: 99% uptime, limited scale, lower cost, zero privacy risk

The moat isn't features — it's zero trust architecture that cloud providers cannot replicate by definition.

Document version: 2026-04-19
Status: Draft for review
Next step: Matt's go/no-go decision, then Phase 0 validation

← Back