type
entity
created
Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
updated
Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
sources
raw/articles/EPIC-060
tags
entity extractor flask gemini pdf ai extraction

Paper PDF Extractor

abstract
The Paper PDF Extractor is a Flask + SQLite service running at localhost:8925 that uses Gemini 2.0 Flash (via OpenRouter) to extract structured product specifications from paper mill PDF datasheets -- 8,919 documents processed, yielding 16,791 extracted products and 4,246 catalog products from 440 mills.

Overview

The Extractor is a standalone service separate from the main marketplace. Its purpose is to take unstructured PDF datasheets from paper mills and extract structured product data (paper type, GSM, width, coating, certifications, etc.) using AI-powered document understanding.

Technical Details

Attribute Value
Location /home/claude/projects/paper-pdf-extractor/
Stack Flask (Python)
Database SQLite at paper_data.db
AI Model Gemini 2.0 Flash via OpenRouter (google/gemini-2.0-flash-001)
Port 8925 (gunicorn)
PM2 Name (runs independently)

API Endpoints

Upload a Datasheet

POST http://localhost:8925/upload
Content-Type: multipart/form-data
Body: file=<PDF>
Response: {"job_id": "abc123", "status": "queued"}

Check Processing Status

GET http://localhost:8925/status/{job_id}
Response: {"job_id": "abc123", "status": "done", "products": [...]}

Status values: queued, processing, done, error

Processing Pipeline

  1. Upload: PDF file received, job created with status queued
  2. Text extraction: PDF text and layout extracted
  3. AI analysis: Gemini 2.0 Flash analyzes the document, identifying product specifications
  4. Structuring: Extracted data normalized into structured product records
  5. Cataloging: Products deduplicated and added to the catalog

Data Statistics

Metric Count
Documents processed 8,919
Extracted products (raw) 16,791
Catalog products (deduplicated) 4,246
Mills identified 440

Database Schema (SQLite)

The Extractor's SQLite database contains:

Integration with Marketplace

Sync Bridge (B2B-060)

The sync_extractor management command imports catalog_products from the Extractor's SQLite database into the marketplace's PostgreSQL Product model. Field mapping:

Extractor Field Marketplace Field
name name
category paper_type (via mapping table)
gsm gsm
coating coating
color color
width_mm width_mm
height_mm length_mm
presentation form
certifications (comma-sep) certifications (JSON array)
fiber_source fiber_type
quality_grade quality
product_code product_code
brand brand
fingerprint extractor_fingerprint
mill_id + mill_name mill (FK, fuzzy matched)

Live Integration (B2B-062)

When a mill uploads a new PDF datasheet through the marketplace UI, a Celery task sends it to the Extractor for processing. The task polls the status endpoint every 3 seconds (max 120 seconds) until extraction completes, then stores the results as JSON on the DatasheetUpload record.

Category Mapping

The Extractor uses paper industry categories that map to the marketplace's PaperType enum:

Extractor Category Marketplace paper_type
Kraftliner kraftliner
Testliner testliner
Fluting Medium fluting
Coated Paper C2S coated
Uncoated Woodfree writing
Newsprint newsprint
Folding Boxboard (FBB) board
Solid Bleached Board (SBS) board
Kraft Paper kraft
Thermal Paper thermal
NCR / Carbonless Paper ncr
Greaseproof Paper greaseproof
(others) other

Sources

Related