type: entity
created: Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
updated: Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
sources: raw/articles/EPIC-060
tags: entity extractor flask gemini pdf ai extraction

Paper PDF Extractor

abstract

The Paper PDF Extractor is a Flask + SQLite service running at localhost:8925 that uses Gemini 2.0 Flash (via OpenRouter) to extract structured product specifications from paper mill PDF datasheets -- 8,919 documents processed, yielding 16,791 extracted products and 4,246 catalog products from 440 mills.

Overview

The Extractor is a standalone service separate from the main marketplace. Its purpose is to take unstructured PDF datasheets from paper mills and extract structured product data (paper type, GSM, width, coating, certifications, etc.) using AI-powered document understanding.

Technical Details

Attribute	Value
Location	`/home/claude/projects/paper-pdf-extractor/`
Stack	Flask (Python)
Database	SQLite at `paper_data.db`
AI Model	Gemini 2.0 Flash via OpenRouter (`google/gemini-2.0-flash-001`)
Port	8925 (gunicorn)
PM2 Name	(runs independently)

API Endpoints

Upload a Datasheet

POST http://localhost:8925/upload
Content-Type: multipart/form-data
Body: file=<PDF>
Response: {"job_id": "abc123", "status": "queued"}

Check Processing Status

GET http://localhost:8925/status/{job_id}
Response: {"job_id": "abc123", "status": "done", "products": [...]}

Status values: queued, processing, done, error

Processing Pipeline

Upload: PDF file received, job created with status queued
Text extraction: PDF text and layout extracted
AI analysis: Gemini 2.0 Flash analyzes the document, identifying product specifications
Structuring: Extracted data normalized into structured product records
Cataloging: Products deduplicated and added to the catalog

Data Statistics

Metric	Count
Documents processed	8,919
Extracted products (raw)	16,791
Catalog products (deduplicated)	4,246
Mills identified	440

Database Schema (SQLite)

The Extractor's SQLite database contains:

mills table -- mill names and metadata
catalog_products table -- deduplicated product records with fields: name, category, gsm, coating, color, width_mm, height_mm, presentation, certifications, fiber_source, quality_grade, product_code, brand, fingerprint, mill_id

Integration with Marketplace

Sync Bridge (B2B-060)

The sync_extractor management command imports catalog_products from the Extractor's SQLite database into the marketplace's PostgreSQL Product model. Field mapping:

Extractor Field	Marketplace Field
name	name
category	paper_type (via mapping table)
gsm	gsm
coating	coating
color	color
width_mm	width_mm
height_mm	length_mm
presentation	form
certifications (comma-sep)	certifications (JSON array)
fiber_source	fiber_type
quality_grade	quality
product_code	product_code
brand	brand
fingerprint	extractor_fingerprint
mill_id + mill_name	mill (FK, fuzzy matched)

Live Integration (B2B-062)

When a mill uploads a new PDF datasheet through the marketplace UI, a Celery task sends it to the Extractor for processing. The task polls the status endpoint every 3 seconds (max 120 seconds) until extraction completes, then stores the results as JSON on the DatasheetUpload record.

Category Mapping

The Extractor uses paper industry categories that map to the marketplace's PaperType enum:

Extractor Category	Marketplace paper_type
Kraftliner	kraftliner
Testliner	testliner
Fluting Medium	fluting
Coated Paper C2S	coated
Uncoated Woodfree	writing
Newsprint	newsprint
Folding Boxboard (FBB)	board
Solid Bleached Board (SBS)	board
Kraft Paper	kraft
Thermal Paper	thermal
NCR / Carbonless Paper	ncr
Greaseproof Paper	greaseproof
(others)	other

Sources

raw/articles/EPIC-060 -- full specification of the Extractor integration

wiki/concepts/epic-060-product-catalog-intelligence -- the epic that connects the Extractor to the marketplace
wiki/summaries/epic-060-summary -- summary of all Epic 060 tickets
wiki/entities/morichal-ai -- another data source (legacy, not AI-extracted)
wiki/concepts/phases-0-to-13-recap -- build context
wiki/entities/pm2-marketplace-stack -- the PM2 ecosystem (Extractor runs separately)