- type
- entity
- created
- Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
- updated
- Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
- sources
- raw/articles/EPIC-060
- tags
- entity extractor flask gemini pdf ai extraction
Paper PDF Extractor
Overview
The Extractor is a standalone service separate from the main marketplace. Its purpose is to take unstructured PDF datasheets from paper mills and extract structured product data (paper type, GSM, width, coating, certifications, etc.) using AI-powered document understanding.
Technical Details
| Attribute | Value |
|---|---|
| Location | /home/claude/projects/paper-pdf-extractor/ |
| Stack | Flask (Python) |
| Database | SQLite at paper_data.db |
| AI Model | Gemini 2.0 Flash via OpenRouter (google/gemini-2.0-flash-001) |
| Port | 8925 (gunicorn) |
| PM2 Name | (runs independently) |
API Endpoints
Upload a Datasheet
POST http://localhost:8925/upload
Content-Type: multipart/form-data
Body: file=<PDF>
Response: {"job_id": "abc123", "status": "queued"}
Check Processing Status
GET http://localhost:8925/status/{job_id}
Response: {"job_id": "abc123", "status": "done", "products": [...]}
Status values: queued, processing, done, error
Processing Pipeline
- Upload: PDF file received, job created with status
queued - Text extraction: PDF text and layout extracted
- AI analysis: Gemini 2.0 Flash analyzes the document, identifying product specifications
- Structuring: Extracted data normalized into structured product records
- Cataloging: Products deduplicated and added to the catalog
Data Statistics
| Metric | Count |
|---|---|
| Documents processed | 8,919 |
| Extracted products (raw) | 16,791 |
| Catalog products (deduplicated) | 4,246 |
| Mills identified | 440 |
Database Schema (SQLite)
The Extractor's SQLite database contains:
- mills table -- mill names and metadata
- catalog_products table -- deduplicated product records with fields: name, category, gsm, coating, color, width_mm, height_mm, presentation, certifications, fiber_source, quality_grade, product_code, brand, fingerprint, mill_id
Integration with Marketplace
Sync Bridge (B2B-060)
The sync_extractor management command imports catalog_products from the Extractor's SQLite database into the marketplace's PostgreSQL Product model. Field mapping:
| Extractor Field | Marketplace Field |
|---|---|
| name | name |
| category | paper_type (via mapping table) |
| gsm | gsm |
| coating | coating |
| color | color |
| width_mm | width_mm |
| height_mm | length_mm |
| presentation | form |
| certifications (comma-sep) | certifications (JSON array) |
| fiber_source | fiber_type |
| quality_grade | quality |
| product_code | product_code |
| brand | brand |
| fingerprint | extractor_fingerprint |
| mill_id + mill_name | mill (FK, fuzzy matched) |
Live Integration (B2B-062)
When a mill uploads a new PDF datasheet through the marketplace UI, a Celery task sends it to the Extractor for processing. The task polls the status endpoint every 3 seconds (max 120 seconds) until extraction completes, then stores the results as JSON on the DatasheetUpload record.
Category Mapping
The Extractor uses paper industry categories that map to the marketplace's PaperType enum:
| Extractor Category | Marketplace paper_type |
|---|---|
| Kraftliner | kraftliner |
| Testliner | testliner |
| Fluting Medium | fluting |
| Coated Paper C2S | coated |
| Uncoated Woodfree | writing |
| Newsprint | newsprint |
| Folding Boxboard (FBB) | board |
| Solid Bleached Board (SBS) | board |
| Kraft Paper | kraft |
| Thermal Paper | thermal |
| NCR / Carbonless Paper | ncr |
| Greaseproof Paper | greaseproof |
| (others) | other |
Sources
- raw/articles/EPIC-060 -- full specification of the Extractor integration
Related
- wiki/concepts/epic-060-product-catalog-intelligence -- the epic that connects the Extractor to the marketplace
- wiki/summaries/epic-060-summary -- summary of all Epic 060 tickets
- wiki/entities/morichal-ai -- another data source (legacy, not AI-extracted)
- wiki/concepts/phases-0-to-13-recap -- build context
- wiki/entities/pm2-marketplace-stack -- the PM2 ecosystem (Extractor runs separately)