# Executive Summary: Private Document Processing API ## Product Definition **What:** A self-hosted document OCR + extraction API that processes images/PDFs through a local vision model (qwen3-vl:8b) and returns structured JSON. Zero cloud vision processing — documents never leave your infrastructure. Designed for privacy-conscious businesses (legal, healthcare-adjacent, financial services, EU GDPR compliance). **Current State:** 486 lines of working code (`document_sorter.py`) that: - Receives images via Telegram DM → saves to `/tmp/dropbox/` → base64 encodes → sends to Gaming PC Ollama qwen3-vl:8b - Extracts: `{vendor, date, category, amount}` with strict taxonomy - Builds filename: `YYYY-MM-DD_Vendor_Category_$Amount.ext` - Uploads to Google Drive (currently blocked — Google account suspended) - Cleans up temp files in `finally` block - Uses `keep_alive: 0` to immediately unload vision model and free VRAM **Gap to Revenue:** The code works for personal use. To sell it as an API service requires: always-on GPU, multi-tenant auth, web API (not Telegram), billing, PDF support, and output storage (not Google Drive). --- ## Market Analysis ### Cloud OCR Pricing (competitive benchmarks) | Provider | Basic OCR | Forms/Tables | Custom Extraction | Notes | |---|---|---|---|---| | **AWS Textract** | $1.50/1K pages | $65/1K pages | N/A | 97% markup for structured data | | **Google Document AI** | $1.50/1K pages | $10/1K pages | $30/1K pages | $300 free credit | | **Azure Document Intelligence** | $1.50/1K pages | $10/1K pages | $30/1K pages | 65% discount at commitment tier | | **Mistral OCR 3** | $2/1K pages | $2/1K pages | $2/1K pages | Batch: $1/1K pages (50% off) | **The privacy moat:** None of these providers can say *"your documents never touch a cloud server."* For legal firms, healthcare-adjacent services, financial advisors, and EU businesses under GDPR — this is compliance, not preference. ### Target Market Segments | Segment | Monthly Volume | Willingness to Pay | Key Pain Point | |---|---|---|---| | Small law firms (1-5 attorneys) | 500-2,000 pages | $49-99/mo | Cloud OCR = malpractice risk | | Financial advisors (RIA) | 1,000-5,000 pages | $99-299/mo | Client data in AWS = compliance nightmare | | Healthcare-adjacent (billing, admin) | 2,000-10,000 pages | $199-499/mo | HIPAA adjacent, can't risk breaches | | EU consultancies | 500-3,000 pages | $79-149/mo | GDPR, data sovereignty | | Self-hosting enthusiasts | 100-1,000 pages | $29-49/mo | Ideological, not price-sensitive | **Market validation:** - Paperless-ngx has 50K+ GitHub stars — proven demand for self-hosted document management - CleanRoll.ai raised funding for CRE rent roll extraction — validates OCR-as-a-service niche - Mistral OCR 3 launched December 2025 with $1/1K pages pricing — proves market pressure on cloud pricing --- ## Technical Architecture ### Current Implementation (v0.1 - Personal Use) ``` Telegram DM (image) → /tmp/dropbox/ → base64 → HTTP POST to Gaming PC Ollama ↓ qwen3-vl:8b inference (120s timeout) ↓ JSON extraction → filename builder ↓ Google Drive upload (BLOCKED) ↓ 👍 Telegram reaction ``` **Hardware dependency:** Gaming PC (3080 Ti, 12GB VRAM) must be ON and connected via Tailscale. Windows + Ollama + `keep_alive: 0`. ### Required Architecture (v1.0 - Revenue Service) ``` Client POST /api/v1/extract (image/PDF + api_key) ↓ FastAPI auth layer (rate limit, tenant isolation) ↓ PDF → image conversion (if needed) via pdf2image ↓ Local Ollama OR always-on GPU endpoint ↓ qwen3-vl:8b inference (15-30s per page) ↓ JSON extraction → structured output ↓ MinIO S3-compatible storage (self-hosted, not AWS) ↓ Webhook callback OR polling endpoint for results ↓ Stripe metered billing (per-page + monthly base) ``` **Critical change:** Replace Google Drive with MinIO (self-hosted S3-compatible object storage) or local filesystem with CDN. Never touch AWS/GCP/Azure for file storage. --- ## Gap Analysis: Current → Revenue | Component | Status | Effort | Blocker | |---|---|---|---| | **Vision model inference** | ✅ Works (Gaming PC) | — | Must be always-on | | **Multi-page PDF support** | ❌ Not built | 3-4 days | pdf2image + page iteration | | **API authentication** | ❌ Not built | 2-3 days | JWT or API key auth, tenant isolation | | **Rate limiting / quotas** | ❌ Not built | 2-3 days | Redis or in-memory tracking | | **Async job queue** | ❌ Not built | 4-5 days | Celery + Redis or FastAPI background tasks | | **Result storage (MinIO)** | ❌ Not built | 2-3 days | Self-hosted S3-compatible storage | | **Webhook callbacks** | ❌ Not built | 1-2 days | POST to client endpoint with results | | **Billing (Stripe)** | ❌ Not built | 3-4 days | Metered billing, usage tracking | | **Dashboard / status page** | ❌ Not built | 5-7 days | Web UI for job status, usage, API keys | | **PDF preprocessing** | ❌ Not built | 3-4 days | Deskew, denoise, OCR optimization | | **Error handling / retries** | ⚠️ Partial | 2-3 days | Dead letter queue, client alerts | **Total engineering effort:** 4-6 weeks for MVP (one person, nights/weekends). --- ## Hardware Investment Required ### Current Setup Bottlenecks | Issue | Current State | Impact | |---|---|---| | Gaming PC not always-on | 3080 Ti sleeps, wakes on demand | OCR unavailable 60%+ of day | | Windows power management | Sleep mode, updates | Unpredictable downtime | | Tailscale dependency | Windows → Beelink → Internet | Two points of failure | | No UPS | Power outage = data loss | Unacceptable for paid service | ### Recommended Hardware Upgrades #### Option A: Dedicated GPU Server (Recommended) | Component | Cost | Purpose | |---|---|---| | Used RTX 3060 12GB (eBay) | $180-220 | Dedicated inference GPU, 24/7 operation | | Low-power x86 SFF PC (used Dell/HP) | $150-250 | Host for GPU, headless Ubuntu | | 650W PSU (if not included) | $50-80 | Power for GPU | | PCIe riser cable (if SFF) | $25-40 | Physical fit | | 1TB NVMe SSD | $80-100 | Model storage + job queue | | **Total** | **$485-690** | Always-on inference server | **Power consumption:** ~150W under load, ~30W idle = ~$15-25/mo electricity. **Alternative:** Used RTX 2060 Super 8GB ($120-150) — enough for qwen3-vl:8b, cheaper entry. #### Option B: NVIDIA Jetson Orin Nano | Component | Cost | Notes | |---|---|---| | Jetson Orin Nano 8GB Dev Kit | $499 | ARM, lower power (~25W max) | | 256GB NVMe | $40 | Storage | | **Total** | **$539** | Lower power, ARM ecosystem | **Tradeoffs:** - Pros: Lower power (~$5/mo), smaller footprint, purpose-built for edge AI - Cons: ARM architecture (some Python wheels don't exist), slower inference than desktop GPU, 8GB RAM limits concurrent jobs **Recommendation:** Option A (used RTX 3060 + SFF PC). More flexible, faster inference, easier troubleshooting. #### Option C: Upgrade Beelink (Not Recommended) Intel N150 has no PCIe slot for GPU. External GPU via Thunderbolt/USB4: $300 enclosure + $200 GPU = $500, more complex, lower bandwidth. Skip. --- ## Cost Model: Self-Hosted vs Cloud ### Monthly Operating Costs (Self-Hosted) | Cost | Amount | Notes | |---|---|---| | Electricity (150W × 24h × 30d) | $20-30 | At $0.15/kWh | | Internet (already paid) | $0 | Home connection sufficient | | Domain + Cloudflare (already paid) | $0 | Existing setup | | Hardware depreciation ($600 / 36mo) | $17 | 3-year lifespan | | **Total monthly COGS** | **$37-47** | Per-tenant marginal cost ≈ $0 | ### Pricing Strategy **Target:** Undercut cloud providers by 50% while offering privacy premium. | Tier | Price | Includes | Cloud Equivalent | |---|---|---|---| | Starter | $29/mo | 1,000 pages, 1 user, email support | AWS: $65-1,500 | | Professional | $79/mo | 5,000 pages, 3 users, webhooks, SLA | AWS: $325-7,500 | | Business | $199/mo | 20,000 pages, 10 users, API access, priority | AWS: $1,300-30,000 | | Enterprise | $499+/mo | Unlimited pages, custom models, dedicated infra | AWS: custom quote | **Break-even:** At $79/mo × 10 customers = $790/mo revenue. COGS $47/mo. Gross margin 94%. --- ## Dev/Test Cycles ### Phase 1: Hardware (Week 1) - Acquire used RTX 3060 + SFF PC - Install Ubuntu 22.04 LTS, Ollama, qwen3-vl:8b - Verify inference speed: target <30s per page - Configure Tailscale static IP or Cloudflare Tunnel ### Phase 2: Core API (Weeks 2-3) - FastAPI scaffolding with auth (API keys) - Image upload endpoint (sync) → returns job_id - Async job processing with Celery + Redis - MinIO setup for result storage - Webhook callback system ### Phase 3: PDF Support (Week 4) - pdf2image integration for multi-page PDFs - Page-by-page processing with progress tracking - Zip output for multi-page docs - Bulk upload endpoint ### Phase 4: Billing (Week 5) - Stripe metered billing integration - Usage tracking per API key - Automatic overage handling - Invoice generation ### Phase 5: Dashboard (Weeks 6-7) - Minimal web UI: job status, usage charts, API key management - Status page with uptime metrics - Error logs view (sanitized) ### Phase 6: Security Hardening (Week 8) - Rate limiting (prevent abuse) - Input validation (prevent injection) - Audit logging (who processed what when) - TLS termination via Cloudflare Tunnel - Fail2ban for SSH/API brute force --- ## Testbench Demo: Go/No-Go Criteria ### Test 1: Inference Performance **Setup:** Single page receipt, qwen3-vl:8b, RTX 3060 12GB **Target:** - Cold start (model not loaded): <60 seconds - Warm inference: <20 seconds per page - Concurrent requests (3): <90 seconds each **Go if:** Average <30s per page under load ### Test 2: Accuracy Benchmark **Dataset:** 100 documents (mix of receipts, invoices, contracts) **Target:** - Vendor name extraction: >90% accuracy - Date extraction: >95% accuracy (correct format) - Amount extraction: >95% accuracy (within $0.01) - Category classification: >85% accuracy **Go if:** Overall field extraction >90% without human correction ### Test 3: Uptime & Reliability **Duration:** 7 days continuous operation **Target:** - Uptime: >99% (excluding planned maintenance) - Zero memory leaks (Ollama stays responsive) - Graceful degradation under load (queue management) - Automatic recovery from GPU OOM **Go if:** Zero unplanned outages, <5min recovery time ### Test 4: End-to-End Latency **Scenario:** Client POST → processing → webhook callback **Target:** - P50 latency: <45 seconds - P95 latency: <120 seconds - P99 latency: <300 seconds (large PDFs) **Go if:** P95 <120s for single-page documents ### Test 5: Cost Validation **Measurement:** 30-day electricity + bandwidth **Target:** - Electricity: <$40/mo - Bandwidth: <100GB/mo (no overage) - Hardware: no thermal throttling, <80°C GPU **Go if:** Monthly COGS <$50 --- ## Risk Assessment | Risk | Likelihood | Impact | Mitigation | |---|---|---|---| | Hardware failure (GPU/SSD) | Medium | High | Hot spare, automated backups, 2-day replacement | | Ollama/qwen3-vl model update breaks API | Low | High | Pin model version, staged rollout, rollback plan | | Customer data breach (local) | Low | Critical | Encryption at rest, no remote access, audit logs | | Power outage | Medium | Medium | UPS (CyberPower 1500VA, $150), graceful shutdown | | Internet outage | Low | High | 4G failover (optional), queue-and-retry | | Legal liability (OCR error) | Medium | Medium | Terms of service disclaimer, $ liability cap | | Stripe account freeze | Low | High | Multi-processor backup (LemonSqueezy) | | Beelink SSD death (cascading failure) | Medium | High | Daily backups to Gaming PC + cloud (encrypted) | --- ## Recommendation ### Go/No-Go Decision: **CONDITIONAL GO** **Proceed if:** 1. ✅ Can invest $500-700 in dedicated GPU hardware within 30 days 2. ✅ Willing to spend 6-8 weeks part-time on MVP 3. ✅ First 3 paying customers identified (even if just "would you pay for this?" conversations) 4. ✅ Accept 6-month payback period on hardware **Defer if:** - ❌ Cannot guarantee always-on GPU (Gaming PC unreliable) - ❌ Not willing to build web UI (Telegram-only won't scale to B2B) - ❌ No LLC/liability protection (healthcare/legal adjacent customers) ### Phased Approach **Phase 0 (Now):** - Validate demand: 5 conversations with law firms/financial advisors - Price test: "Would you pay $79/mo for unlimited private OCR?" - Build waitlist **Phase 1 (Month 1):** - Buy hardware, set up dedicated inference server - Build async API + MinIO storage - Dogfood with personal documents **Phase 2 (Month 2):** - Stripe integration, billing - Onboard 3 beta customers at $29/mo (discounted) - Iterate on extraction accuracy **Phase 3 (Month 3):** - Dashboard web UI - Public launch at $79/mo - Target: 10 customers ($790/mo) → break even **Conservative projection:** 10 customers by Month 6 = $790/mo revenue, $47/mo COGS, $743/mo gross profit. Annual: ~$8,900 gross profit on ~$700 hardware investment. --- ## Appendix: Competitive Moat Analysis **Why customers choose self-hosted over cloud:** | Customer Type | Cloud Fear | Our Pitch | |---|---|---| | Law firm | Malpractice if client data leaked | "Documents never leave your server. Zero cloud touch." | | Financial advisor | SEC audit, client trust | "Audit trail shows local processing only." | | Healthcare admin | HIPAA violation ($1.5M fine) | "No BAA needed — no third-party processing." | | EU consultancy | GDPR Article 44 (data transfers) | "Data sovereignty guaranteed. EU server option available." | | Privacy enthusiast | Surveillance capitalism | "Open source, self-hosted, auditable code." | **Differentiation from Paperless-ngx:** - Paperless: document *management* (storage, tagging, search) - Us: document *processing API* (OCR, extraction, structured output) - Complementary: customers use both **Differentiation from cloud OCR:** - Cloud: 99.9% uptime, infinite scale, higher cost, privacy risk - Us: 99% uptime, limited scale, lower cost, zero privacy risk The moat isn't features — it's **zero trust architecture** that cloud providers cannot replicate by definition. --- *Document version: 2026-04-19* *Status: Draft for review* *Next step: Matt's go/no-go decision, then Phase 0 validation*