type
concept
created
Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
updated
Tue Apr 07 2026 02:00:00 GMT+0200 (Central European Summer Time)
sources
raw/articles/PRD
tags
ingestion excel pipeline mills celery parsing

Excel Ingestion Pipeline

abstract
Mills email surplus Excel spreadsheets to dedicated addresses; a Celery-based pipeline polls inboxes every 5 minutes, applies per-mill parser configurations, validates and deduplicates data, then commits items through admin review.

Overview

The Excel ingestion pipeline is the primary mechanism for getting surplus data into the marketplace. It embodies the "zero behavior change" principle: mills continue emailing Excel spreadsheets exactly as they already do with brokers. No new software, no login required, no format changes. The platform handles all parsing and structuring.

From the Madrid meeting: "We can ask each of them, send the lots, send the list monthly, or send it to us. Send the list the way you want. Just always send the same list."

Pipeline Architecture

The pipeline has seven stages:

1. Email Poller (Celery Beat, every 5 minutes)

A scheduled Celery task checks all active mill ingestion inboxes for new emails. Each mill has a unique ingestion email address ({mill-slug}@surplus.marketplace.com). The poller:

Accepted file types: .xlsx (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet), .xls (application/vnd.ms-excel), .csv (text/csv).

2. File Extractor

3. Parser Engine (Per-Mill Config)

Each mill has a ParserConfig that maps their specific Excel layout to the system's fields. This is necessary because every mill has a different format. The config specifies:

The parser handles many column naming conventions across languages:

4. Data Validator

Each parsed row is validated:

Check Rule Action on Failure
GSM range 13 <= gsm <= 500 Flag error, skip row
Width range 100 <= width_mm <= 5000 Flag error, skip row
Quantity > 0 Flag error, skip row
Price > 0 Flag error, skip row
Price sanity Within 3x typical range for paper_type Flag warning, include but mark
Paper type Must resolve to known type Flag error, skip row
Quality grade Must be A, B, or C Default to B, flag warning
Duplicate lot_reference + paper_type + gsm + width Update existing instead of creating new

Failed items do not block successful items in the same batch. Each batch tracks items_found, items_created, items_updated, items_skipped, and errors (as JSONB).

5. Review Queue (MVP)

In MVP, parsed data goes through admin review before going live. The admin sees:

In V2+, trusted mills can use auto-commit, bypassing admin review.

6. Commit Engine

On commit:

7. Post-Commit

After commit:

Batch Status State Machine

received -> parsing -> parsed -> validated -> committed
                                    |
                                  failed

The IngestionBatch entity tracks the full lifecycle including processing_time_ms for performance monitoring.

Future: AI-Assisted Parsing (V1.5+)

For new mills where no ParserConfig exists, an LLM-based approach will auto-detect column mappings by analyzing column headers and the first 5 data rows. The admin reviews and confirms the suggested mapping, which is saved as a new ParserConfig.

Sources

Related