- type
- concept
- created
- Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
- updated
- Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
- sources
- raw/articles/PRD
- tags
- ingestion excel pipeline mills celery parsing
Excel Ingestion Pipeline
Overview
The Excel ingestion pipeline is the primary mechanism for getting surplus data into the marketplace. It embodies the "zero behavior change" principle: mills continue emailing Excel spreadsheets exactly as they already do with brokers. No new software, no login required, no format changes. The platform handles all parsing and structuring.
From the Madrid meeting: "We can ask each of them, send the lots, send the list monthly, or send it to us. Send the list the way you want. Just always send the same list."
Pipeline Architecture
The pipeline has seven stages:
1. Email Poller (Celery Beat, every 5 minutes)
A scheduled Celery task checks all active mill ingestion inboxes for new emails. Each mill has a unique ingestion email address ({mill-slug}@surplus.marketplace.com). The poller:
- Fetches new emails from each mill's inbox
- Verifies the sender is a registered mill contact (emails from unknown senders are flagged for admin review, not processed)
- Extracts attachments
Accepted file types: .xlsx (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet), .xls (application/vnd.ms-excel), .csv (text/csv).
2. File Extractor
- Validates file format and size (max 10MB)
- Computes SHA-256 hash for deduplication
- If the hash matches a previously processed batch, the file is rejected with a notification (prevents re-processing of the same list)
- Stores the file on the server
3. Parser Engine (Per-Mill Config)
Each mill has a ParserConfig that maps their specific Excel layout to the system's fields. This is necessary because every mill has a different format. The config specifies:
- header_row and data_start_row -- where data begins
- column_mapping -- which column maps to which field, with transforms
- lookup_tables -- for translating mill-specific terminology (e.g., "UWTTL" to "white_top_testliner", "KL" to "kraftliner")
- unit_conversions -- for handling different units (cm to mm, kg to MT, inches to mm)
- skip_rows_if -- conditions for ignoring rows (blank paper type, zero quantity)
- deduplication_key -- fields used to detect existing items (typically lot_reference + paper_type + gsm + width_mm)
The parser handles many column naming conventions across languages:
- GSM: "Grammage", "GSM", "g/m2", "Substance", "Basis Weight", "Gramaje" (Spanish), "Gewicht" (German)
- Width: "Width", "Roll Width", "Reel Width", "Ancho" (Spanish), "Breite" (German)
- Quantity: "Quantity", "Qty", "MT", "Tons", "Cantidad" (Spanish), "Menge" (German)
- Price: "Price", "Price/MT", "USD/MT", "EUR/MT", "Precio" (Spanish), "Preis" (German)
4. Data Validator
Each parsed row is validated:
| Check | Rule | Action on Failure |
|---|---|---|
| GSM range | 13 <= gsm <= 500 | Flag error, skip row |
| Width range | 100 <= width_mm <= 5000 | Flag error, skip row |
| Quantity | > 0 | Flag error, skip row |
| Price | > 0 | Flag error, skip row |
| Price sanity | Within 3x typical range for paper_type | Flag warning, include but mark |
| Paper type | Must resolve to known type | Flag error, skip row |
| Quality grade | Must be A, B, or C | Default to B, flag warning |
| Duplicate | lot_reference + paper_type + gsm + width | Update existing instead of creating new |
Failed items do not block successful items in the same batch. Each batch tracks items_found, items_created, items_updated, items_skipped, and errors (as JSONB).
5. Review Queue (MVP)
In MVP, parsed data goes through admin review before going live. The admin sees:
- Batch summary: mill name, filename, items found, errors, warnings
- Each parsed item with its extracted specs
- Price warnings for items above typical range
- Error details for skipped rows
- Actions: Commit All, Commit Selected, Reject Batch
In V2+, trusted mills can use auto-commit, bypassing admin review.
6. Commit Engine
On commit:
- Creates new SurplusItem records (status: available) or updates existing ones (matched by deduplication key)
- Applies the mill's default visibility rules to new items
7. Post-Commit
After commit:
- Runs the matching algorithm for all new/updated surplus items
- Sends confirmation email to the mill: "We received your update -- X lots listed"
- Notifies admin of any parsing errors or warnings
- Queues newsletter generation for matched buyers
Batch Status State Machine
received -> parsing -> parsed -> validated -> committed
|
failed
The IngestionBatch entity tracks the full lifecycle including processing_time_ms for performance monitoring.
Future: AI-Assisted Parsing (V1.5+)
For new mills where no ParserConfig exists, an LLM-based approach will auto-detect column mappings by analyzing column headers and the first 5 data rows. The admin reviews and confirms the suggested mapping, which is saved as a new ParserConfig.
Sources
- raw/articles/PRD -- sections 6.1, 12.1-12.5, 5.10
Related
- wiki/concepts/matching-algorithm -- triggered after surplus items are committed
- wiki/concepts/geographic-visibility-system -- default rules applied on commit
- wiki/concepts/newsletter-generation -- matched buyers queued for newsletters
- wiki/concepts/spec-based-matching -- ingested items are matched against buyer specs