type: concept
created: Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
updated: Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
sources: raw/articles/EPIC-060
tags: product-catalog pdf-extraction ai gemini onboarding matching

Epic 060: Product Catalog Intelligence

abstract

Product catalog intelligence is the system that enables mills to onboard their products effortlessly via PDF datasheet upload, with AI-powered extraction using Gemini Flash, automatic matching against existing products, and admin review -- eliminating the need for manual data entry of 30+ fields per product.

The Problem

Mills have product datasheets (PDFs) already. They will not type 30 fields into a web form for each product. Meanwhile, the Extractor pipeline has already processed 8,919 documents and identified 4,246 catalog products from 440 mills. The gap is connecting the Extractor's output to the marketplace's Product model.

How It Works

Step 1: Sync Bridge (B2B-060)

A one-time (re-runnable) management command imports existing catalog_products from the Extractor's SQLite database into the marketplace's PostgreSQL Product model. Uses an extractor_fingerprint field for idempotency -- re-running the command skips duplicates.

Field mapping covers 14 attributes including a category-to-paper_type mapping table (e.g., "Kraftliner" -> kraftliner, "Uncoated Woodfree" -> writing, "Folding Boxboard (FBB)" -> board).

Step 2: Datasheet Upload (B2B-061)

Mills or admins upload a PDF datasheet. The system creates a DatasheetUpload record with a lifecycle: pending -> processing -> extracted -> review -> accepted/rejected. The file is stored at MEDIA_ROOT/datasheets/{mill_id}/{filename}.

Step 3: AI Extraction (B2B-062)

A Celery task sends the uploaded PDF to the Extractor pipeline (Flask + SQLite + Gemini 2.0 Flash at localhost:8925). The task polls for completion every 3 seconds (max 120s). Extracted product specs are stored as JSON on the DatasheetUpload record. Three retries on transient errors.

Step 4: Product Matching (B2B-063)

Each extracted product is matched against existing Products using a scoring algorithm:

Match Type	Confidence	Criteria
Exact	>= 0.95	Same mill + paper_type + GSM within +/-2 + width within +/-10mm
Close	0.60-0.94	Same paper_type + GSM within +/-5, different mill or width
New	< 0.60	No significant match -- candidate for new product creation

Step 5: Admin Review (B2B-064)

An inbox-style dashboard shows all uploads with status badges. The admin can drill into any upload to see the original PDF alongside extracted products with match results and confidence scores. Per-product actions: Accept (creates or links product), Edit (modify before accepting), Skip. Bulk "Accept All" for matches with confidence >= 0.90.

Step 6: Mill Self-Service (B2B-065)

Mill users get a drag-and-drop upload zone accepting PDF files (max 10MB). They see real-time processing status and a summary when extraction completes ("We found X products in your datasheet. An admin will review and add them to your catalog.").

Complementary Features

Clone Product (B2B-066)

A "Clone" button on any product duplicates it with all specs and opens the edit form pre-filled. Critical for mills producing the same paper in 6 different GSM weights -- clone, change GSM, save.

Excel Template Import (B2B-067)

Downloadable .xlsx template with sample data, dropdown validations for enum fields, and a reference sheet. Upload flow: parse -> preview (green=valid, red=errors) -> confirm import.

Empty State Onboarding (B2B-068)

When a mill has zero products, a guided screen offers three paths: Upload Datasheet, Import from Excel, or Add Manually. The admin dashboard shows "X mills with 0 products" as an action item.

Technical Architecture

The Extractor pipeline is a separate service:

Location: /home/claude/projects/paper-pdf-extractor/
Stack: Flask + SQLite + Gemini 2.0 Flash via OpenRouter
Port: 8925 (gunicorn)
Stats: 8,919 docs, 16,791 extracted products, 4,246 catalog products, 440 mills
Upload endpoint: POST /upload (multipart, returns job_id)
Status endpoint: GET /status/{job_id}

Sources

raw/articles/EPIC-060 -- full epic specification with acceptance criteria

wiki/entities/paper-pdf-extractor -- the Flask extraction service entity
wiki/summaries/epic-060-summary -- summary of all 9 tickets
wiki/concepts/spec-based-matching -- the broader matching system this feeds into
wiki/entities/morichal-ai -- earlier data import from legacy system
wiki/concepts/phases-0-to-13-recap -- where B2B-068 onboarding was completed