- type
- concept
- created
- Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
- updated
- Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
- sources
- raw/articles/EPIC-060
- tags
- product-catalog pdf-extraction ai gemini onboarding matching
Epic 060: Product Catalog Intelligence
The Problem
Mills have product datasheets (PDFs) already. They will not type 30 fields into a web form for each product. Meanwhile, the Extractor pipeline has already processed 8,919 documents and identified 4,246 catalog products from 440 mills. The gap is connecting the Extractor's output to the marketplace's Product model.
How It Works
Step 1: Sync Bridge (B2B-060)
A one-time (re-runnable) management command imports existing catalog_products from the Extractor's SQLite database into the marketplace's PostgreSQL Product model. Uses an extractor_fingerprint field for idempotency -- re-running the command skips duplicates.
Field mapping covers 14 attributes including a category-to-paper_type mapping table (e.g., "Kraftliner" -> kraftliner, "Uncoated Woodfree" -> writing, "Folding Boxboard (FBB)" -> board).
Step 2: Datasheet Upload (B2B-061)
Mills or admins upload a PDF datasheet. The system creates a DatasheetUpload record with a lifecycle: pending -> processing -> extracted -> review -> accepted/rejected. The file is stored at MEDIA_ROOT/datasheets/{mill_id}/{filename}.
Step 3: AI Extraction (B2B-062)
A Celery task sends the uploaded PDF to the Extractor pipeline (Flask + SQLite + Gemini 2.0 Flash at localhost:8925). The task polls for completion every 3 seconds (max 120s). Extracted product specs are stored as JSON on the DatasheetUpload record. Three retries on transient errors.
Step 4: Product Matching (B2B-063)
Each extracted product is matched against existing Products using a scoring algorithm:
| Match Type | Confidence | Criteria |
|---|---|---|
| Exact | >= 0.95 | Same mill + paper_type + GSM within +/-2 + width within +/-10mm |
| Close | 0.60-0.94 | Same paper_type + GSM within +/-5, different mill or width |
| New | < 0.60 | No significant match -- candidate for new product creation |
Step 5: Admin Review (B2B-064)
An inbox-style dashboard shows all uploads with status badges. The admin can drill into any upload to see the original PDF alongside extracted products with match results and confidence scores. Per-product actions: Accept (creates or links product), Edit (modify before accepting), Skip. Bulk "Accept All" for matches with confidence >= 0.90.
Step 6: Mill Self-Service (B2B-065)
Mill users get a drag-and-drop upload zone accepting PDF files (max 10MB). They see real-time processing status and a summary when extraction completes ("We found X products in your datasheet. An admin will review and add them to your catalog.").
Complementary Features
Clone Product (B2B-066)
A "Clone" button on any product duplicates it with all specs and opens the edit form pre-filled. Critical for mills producing the same paper in 6 different GSM weights -- clone, change GSM, save.
Excel Template Import (B2B-067)
Downloadable .xlsx template with sample data, dropdown validations for enum fields, and a reference sheet. Upload flow: parse -> preview (green=valid, red=errors) -> confirm import.
Empty State Onboarding (B2B-068)
When a mill has zero products, a guided screen offers three paths: Upload Datasheet, Import from Excel, or Add Manually. The admin dashboard shows "X mills with 0 products" as an action item.
Technical Architecture
The Extractor pipeline is a separate service:
- Location:
/home/claude/projects/paper-pdf-extractor/ - Stack: Flask + SQLite + Gemini 2.0 Flash via OpenRouter
- Port: 8925 (gunicorn)
- Stats: 8,919 docs, 16,791 extracted products, 4,246 catalog products, 440 mills
- Upload endpoint: POST /upload (multipart, returns job_id)
- Status endpoint: GET /status/{job_id}
Sources
- raw/articles/EPIC-060 -- full epic specification with acceptance criteria
Related
- wiki/entities/paper-pdf-extractor -- the Flask extraction service entity
- wiki/summaries/epic-060-summary -- summary of all 9 tickets
- wiki/concepts/spec-based-matching -- the broader matching system this feeds into
- wiki/entities/morichal-ai -- earlier data import from legacy system
- wiki/concepts/phases-0-to-13-recap -- where B2B-068 onboarding was completed