type: concept
created: Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
updated: Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
sources: raw/articles/PRD
tags: ingestion excel pipeline mills celery parsing

Excel Ingestion Pipeline

abstract

Mills email surplus Excel spreadsheets to dedicated addresses; a Celery-based pipeline polls inboxes every 5 minutes, applies per-mill parser configurations, validates and deduplicates data, then commits items through admin review.

Overview

The Excel ingestion pipeline is the primary mechanism for getting surplus data into the marketplace. It embodies the "zero behavior change" principle: mills continue emailing Excel spreadsheets exactly as they already do with brokers. No new software, no login required, no format changes. The platform handles all parsing and structuring.

From the Madrid meeting: "We can ask each of them, send the lots, send the list monthly, or send it to us. Send the list the way you want. Just always send the same list."

Pipeline Architecture

The pipeline has seven stages:

1. Email Poller (Celery Beat, every 5 minutes)

A scheduled Celery task checks all active mill ingestion inboxes for new emails. Each mill has a unique ingestion email address ({mill-slug}@surplus.marketplace.com). The poller:

Fetches new emails from each mill's inbox
Verifies the sender is a registered mill contact (emails from unknown senders are flagged for admin review, not processed)
Extracts attachments

Accepted file types: .xlsx (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet), .xls (application/vnd.ms-excel), .csv (text/csv).

2. File Extractor

Validates file format and size (max 10MB)
Computes SHA-256 hash for deduplication
If the hash matches a previously processed batch, the file is rejected with a notification (prevents re-processing of the same list)
Stores the file on the server

3. Parser Engine (Per-Mill Config)

Each mill has a ParserConfig that maps their specific Excel layout to the system's fields. This is necessary because every mill has a different format. The config specifies:

header_row and data_start_row -- where data begins
column_mapping -- which column maps to which field, with transforms
lookup_tables -- for translating mill-specific terminology (e.g., "UWTTL" to "white_top_testliner", "KL" to "kraftliner")
unit_conversions -- for handling different units (cm to mm, kg to MT, inches to mm)
skip_rows_if -- conditions for ignoring rows (blank paper type, zero quantity)
deduplication_key -- fields used to detect existing items (typically lot_reference + paper_type + gsm + width_mm)

The parser handles many column naming conventions across languages:

GSM: "Grammage", "GSM", "g/m2", "Substance", "Basis Weight", "Gramaje" (Spanish), "Gewicht" (German)
Width: "Width", "Roll Width", "Reel Width", "Ancho" (Spanish), "Breite" (German)
Quantity: "Quantity", "Qty", "MT", "Tons", "Cantidad" (Spanish), "Menge" (German)
Price: "Price", "Price/MT", "USD/MT", "EUR/MT", "Precio" (Spanish), "Preis" (German)

4. Data Validator

Each parsed row is validated:

Check	Rule	Action on Failure
GSM range	13 <= gsm <= 500	Flag error, skip row
Width range	100 <= width_mm <= 5000	Flag error, skip row
Quantity	> 0	Flag error, skip row
Price	> 0	Flag error, skip row
Price sanity	Within 3x typical range for paper_type	Flag warning, include but mark
Paper type	Must resolve to known type	Flag error, skip row
Quality grade	Must be A, B, or C	Default to B, flag warning
Duplicate	lot_reference + paper_type + gsm + width	Update existing instead of creating new

Failed items do not block successful items in the same batch. Each batch tracks items_found, items_created, items_updated, items_skipped, and errors (as JSONB).

5. Review Queue (MVP)

In MVP, parsed data goes through admin review before going live. The admin sees:

Batch summary: mill name, filename, items found, errors, warnings
Each parsed item with its extracted specs
Price warnings for items above typical range
Error details for skipped rows
Actions: Commit All, Commit Selected, Reject Batch

In V2+, trusted mills can use auto-commit, bypassing admin review.

6. Commit Engine

On commit:

Creates new SurplusItem records (status: available) or updates existing ones (matched by deduplication key)
Applies the mill's default visibility rules to new items

7. Post-Commit

After commit:

Runs the matching algorithm for all new/updated surplus items
Sends confirmation email to the mill: "We received your update -- X lots listed"
Notifies admin of any parsing errors or warnings
Queues newsletter generation for matched buyers

Batch Status State Machine

received -> parsing -> parsed -> validated -> committed
                                    |
                                  failed

The IngestionBatch entity tracks the full lifecycle including processing_time_ms for performance monitoring.

Future: AI-Assisted Parsing (V1.5+)

For new mills where no ParserConfig exists, an LLM-based approach will auto-detect column mappings by analyzing column headers and the first 5 data rows. The admin reviews and confirms the suggested mapping, which is saved as a new ParserConfig.

Sources

raw/articles/PRD -- sections 6.1, 12.1-12.5, 5.10

wiki/concepts/matching-algorithm -- triggered after surplus items are committed
wiki/concepts/geographic-visibility-system -- default rules applied on commit
wiki/concepts/newsletter-generation -- matched buyers queued for newsletters
wiki/concepts/spec-based-matching -- ingested items are matched against buyer specs

Excel Ingestion Pipeline

Overview

Pipeline Architecture

1. Email Poller (Celery Beat, every 5 minutes)

2. File Extractor

3. Parser Engine (Per-Mill Config)

4. Data Validator

5. Review Queue (MVP)

6. Commit Engine

7. Post-Commit

Batch Status State Machine

Future: AI-Assisted Parsing (V1.5+)

Sources

Related