From messy product feeds to demo-ready e-commerce data

Introduction

Demonstrating e-commerce search requires a catalog with realistic titles, working images, usable categories and attributes, and sufficient product variety to observe the effect of query rewriting or re-ranking. In other words, the dataset needs to behave like a real catalog.

In practice, it can be surprisingly hard to find demo data that is (a) sufficiently large, (b) easy to ingest, and (c) safe to use in commercial settings. A common workaround is to use a smaller dataset, a synthetic catalog, or a source format that requires significant one-off parsing. That tends to make demos less convincing and reduces the likelihood that the same experience can be reproduced on real-world production catalogs.

For some query rewriting work I’m involved in, I needed an image-rich product catalog and a clean representation of that catalog that could be indexed and iterated on quickly. To support this, I built two tools that take data from open sources and output demo-ready NDJSON. One pipeline is based on Open Food Facts and produces over 100K usable grocery products; the other is based on Open Icecat and produces over 1 million usable computer/electronics products. These tools take data that often comes in awkward formats (nested JSON/XML, inconsistent field names, and image metadata that isn’t directly usable) and convert it into clean NDJSON documents with a consistent schema.

The harvesters

To produce demo datasets, I built two open-source transformation pipelines. These tools convert messy product records into clean Elasticsearc-ready NDJSON:

Icecat Harvester: Downloads Icecat XML and normalizes electronics metadata.
Open Food Facts Extractor: Parses Open Food Facts JSONL and extracts grocery attributes and images.

NDJSON

Elasticsearch ingests JSON documents. NDJSON is a file format that allows one JSON object per line, which is trivial to bulk ingest. The output of these harvesters is a clean, stable schema that can be bulk-ingested directly into a search engine.

Dataset #1: Open Food Facts as a demo catalog

Open Food Facts is not an “e-commerce” dataset in the Amazon sense. It is a product database built for transparency: ingredients, allergens, nutrition, and label-derived metadata. The reason it works well for demos is that it still behaves like a real product catalog: it has product names, categories, and images. Importantly, its data reuse posture is explicit and documented (ODbL for the database; CC BY-SA for images). That clarity matters when you intend to show a dataset to customers.

The raw Open Food Facts export is a large JSONL file with a lot of structure. The extractor repository turns it into clean NDJSON that is immediately indexable. It also computes image URLs based on the official image URL scheme, and it can apply quality gates such as “English titles/descriptions” and “must have a front image.”

Open Food Facts inclusion criteria

After parsing, filtering, and cleaning over 4.2 million source records from Open Food Facts, the resulting dataset contains over 100K clean JSON objects. This reduction is expected: the extractor is intentionally strict because the goal is a catalog suitable for demos.

A record is included only if it meets all of the following:

English title and description
A usable front image
At least one meaningful category (placeholder/empty categories are excluded)

This is intentional: for demos, incomplete products (missing images or categories) are usually not useful.

After: Cleaned NDJSON for grocery search

This resulting data is ready to be indexed into a search engine like Elasticsearch or OpenSearch, and looks as follows:

{
  "id": "0008127000019",
  "title": "Extra virgin olive oil",
  "brand": "Athena Imports",
  "description": "Extra virgin olive oil. Extra virgin olive oil Key specifications: Category: Plant based foods and beverages; Serving size: 15 ml; Nutri-Score: B; NOVA group: 2; Eco-Score: E; Dietary restrictions: vegan, vegetarian; Ingredients analysis: palm-oil-free, vegan, vegetarian; Energy (kcal/100g): 800 kcal; Fat (g/100g): 93.3 g; Saturated fat (g/100g): 13.3 g; Sugars (g/100g): 0 g; Salt (g/100g): 0 g; Protein (g/100g): 0 g; Countries: United States",
  "image_url": "https://images.openfoodfacts.org/images/products/000/812/700/0019/front_en.5.400.jpg",
  "price": 2.49,
  "currency": "EUR",
  "categories": [
    "Plant based foods and beverages",
    "Plant based foods",
    "Fats"
  ],
  "attrs": {
    "Serving size": "15 ml",
    "Nutri-Score": "B",
    "NOVA group": "2",
    "Eco-Score": "E",
    "Ingredients analysis": "palm-oil-free, vegan, vegetarian",
    "Countries": "United States",
    "Category": "Plant based foods and beverages",
    "Energy (kcal/100g)": "800 kcal",
    "Fat (g/100g)": "93.3 g",
    "Saturated fat (g/100g)": "13.3 g",
    "Sugars (g/100g)": "0 g",
    "Salt (g/100g)": "0 g",
    "Protein (g/100g)": "0 g",
    "Dietary restrictions": "vegan, vegetarian",
    "Price source": "estimated_unit_model",
    "Pricing bucket": "oils_fats",
    "Estimated unit price": "11.59 EUR/l (15ml, bucket=oils_fats, scale=1.21, ratio=0.15)"
  },
  "attr_keys": [
    "Category",
    "Countries",
    "Dietary restrictions",
    "Eco-Score",
    "Energy (kcal/100g)",
    "Estimated unit price",
    "Fat (g/100g)",
    "Ingredients analysis",
    "NOVA group",
    "Nutri-Score",
    "Price source",
    "Pricing bucket",
    "Protein (g/100g)",
    "Salt (g/100g)",
    "Saturated fat (g/100g)",
    "Serving size",
    "Sugars (g/100g)"
  ],
  "dietary_restrictions": [
    "vegan",
    "vegetarian"
  ]
}

Note: attrs["Dietary restrictions"] reflects the raw, display-friendly value, while dietary_restrictions is a normalized array for efficient filtering/faceting.

Benefits

The key is that once the data is in this shape, you can iterate on search logic quickly. Query rewriting rules, category routing, facet behavior, synonyms, typo handling, attribute extraction — all of it becomes easier when the data is already clean.

How the data looks

Below is an example of how the cleaned Open Food Facts data looks in a simple e-commerce frontend.

Demo screenshot

Dataset #2: Icecat for electronics-style product catalogs

Open Food Facts is excellent for food/CPG. But some demos benefit from an electronics-style catalog with spec-rich attributes and product-type variety. That’s where Icecat is useful.

Icecat is typically consumed via XML interfaces and nested structures that cannot be indexed directly into Elasticsearch. The Icecat harvester repo is designed as data transformation tool, where downloading and parsing are separate steps. That separation matters: you can download once, iterate on schema transformation many times, and regenerate clean NDJSON without re-downloading everything.

Icecat inclusion criteria (why 25M → 3.5M → 1M)

The raw Icecat index spans more than 25 million data sheets in the global catalog. However, I have targeted a subset of the open index covering only “interesting” categories of products (the desired categories are easily configurable). This subset contains approximately 3.5M data sheets. After further filtering and processing, I end up with about 1M demo-quality products.

The final resulting demo dataset is significantly smaller than the original 25M for the following reasons:

Open vs. full tier: We specifically target the “Open Icecat” portion of the catalog. While the “Full Icecat” database includes over 28,000 brands, only a subset (the “sponsoring brands” like HP, Lenovo, and Samsung) make their content available via the Open Icecat tier.
Regional/category filtering: We use a targets.txt file to focus only on high-utility categories (like Laptops and Smartphones). This avoids millions of low-signal categories (e.g., spare parts, cables) that typically clutter a demo search experience.
The demo-ready quality gate: Our script applies a strict filter: No Image = No Entry. A product without a visual asset is a dead-end in a demo UI. By requiring at least one high-resolution image URL and a valid title, we prune the “metadata-only” records that make up a large portion of the raw feed.
Deduplication: Icecat often provides separate XML files for the same product to handle different languages or minor regional packaging variants. Our pipeline deduplicates these by Product ID, ensuring that the search index contains one canonical record per item rather than many near-identical variants.

After: Cleaned NDJSON for electronics search

After parsing, filtering, and cleaning over 3.5 million source records, the resulting dataset contains over a million clean JSON objects, that are ready to be indexed into a search engine like Elasticsearch or OpenSearch.

{
  "id": "91778569",
  "title": "Lenovo Legion 5 15ARH05H AMD Ryzen™ 7 4800H Laptop...",
  "brand": "Lenovo",
  "description": "Minimal meets mighty... Thermally tuned via Legion Coldfront 2.0.",
  "price": 865.33,
  "currency": "EUR",
  "image_url": "https://images.icecat.biz/img/gallery_mediums/79117985_5269963235.jpg",
  "categories": ["Laptops"],
  "attrs": {
    "Processor family": "AMD Ryzen™ 7",
    "Internal memory": "16 GB",
    "Weight": "2.46 kg"
  },
  "attr_keys": ["Processor family", "Internal memory", "Weight"]
}

How the data looks

Below is an example of how the cleaned icecat data looks in a simple e-commerce frontend.

Demo screenshot

One schema, two sources

The goal is that a loader or indexing pipeline can ingest both Icecat and Open Food Facts with the same code path. That’s why both repositories converge on a similar NDJSON structure:

Field	Type	Source: Icecat Logic (Electronics)	Source: Open Food Facts Logic (Grocery)
id	string	Unique Icecat Product ID	Padded GTIN-13 Barcode
title	string	Full Marketing Title	English Product Name
brand	string	Manufacturer (e.g., Apple, Lenovo)	Brand/Producer Name
description	string	Synthesis: Marketing text + Key Technical Specifications	Synthesis: Ingredients + Key Nutritional Metadata
price	float	Heuristic: Category baseline modified by Brand premium	Estimated: Unit pricing model based on category & weight
currency	string	Fixed (EUR)	Fixed (EUR)
image_url	string	High-Quality: Selects the best available primary product photo	Computed: URL derived from product code and image metadata
categories	list	Single-item list (Primary Icecat Category)	Hierarchical list (from broad to specific)
attrs	object	Flattened: Technical specs (e.g., `"RAM": "16GB"`)	Flattened: Nutritional/Labels (e.g., `"Nutri-Score": "A"`)
attr_keys	list	List of keys in attrs for dynamic faceting	List of keys in attrs for dynamic faceting

Why this schema works for search

By converging on a single schema contract, the ingestion pipeline and demo UI can remain stable across both datasets. Whether you are indexing 100K olive oils or 1M laptops, the same configuration applies:

Consistent faceting: The attrs object is a flat dictionary. In Elasticsearch, this is typically mapped as a flattened field to support dynamic faceting without a mapping explosion.
Searchable specs: High-value technical data is injected into the description field. This ensures that a user searching for “Ryzen 7” or “Palm-oil free” finds the product via full-text search even if those specific attributes aren’t explicitly boosted.
Visual reliability: Both pipelines discard any record missing a valid image_url. This reduces the chances of your demo showing a “broken image” icon.

Where WANDS fits: evaluation, not demos

It’s worth calling out one dataset that I do consider extremely valuable: WANDS (Wayfair ANnotation Dataset).

WANDS includes query-product relevance judgments. That makes it excellent for benchmarking relevance changes and checking whether something you did actually improved ranking quality. The dataset is also explicitly MIT licensed.

WANDS repo: https://github.com/wayfair/WANDS
Paper: https://easychair.org/publications/preprint/j2D4/download

However, WANDS is comparatively small, and it is not ideal as a primary “demo catalog.” The biggest practical issue for demos is that a demo UI benefits enormously from images, and WANDS is not structured as an image-forward catalog. For my purposes, WANDS is something I want in the toolbox for evaluation, while the demo catalog comes from OFF and Icecat.

In other words: WANDS allows you to quantitatively evaluate quality of search results, while Open Food Facts/Icecat data allows you to create realistic demos and to evaluate results qualitatively.

Other datasets we considered (and why we did not choose them)

There are many attractive datasets in the research ecosystem, but many of them come with constraints that make them awkward for customer-facing demos or reusable internal assets.

Amazon-derived datasets (UCSD / McAuley Lab)

The UCSD / McAuley Lab Amazon datasets are impressive: reviews, metadata, sometimes images, large scale. For demos, they look great on paper.

The problem is not technical quality. The problem is posture: the underlying content originates from Amazon, and many releases are framed as academic research resources. In addition, the Amazon Reviews 2023 ecosystem is frequently referenced with non-commercial research restrictions in some derivative distributions.

Amazon Reviews 2023 landing page: https://amazon-reviews-2023.github.io/
UCSD Amazon review data: https://jmcauley.ucsd.edu/data/amazon/
Example derivative noting “Academic, non-commercial research use only”: https://huggingface.co/datasets/bagadbilla/amazon-reviews-2023-trimmed

If you are building a reusable demo asset, there may be legal obstacles to using this data.

SIGIR eCom / Coveo data challenge datasets

These datasets are excellent for research, especially session-based behavior (queries, clicks, add-to-cart, etc.). They are also often explicitly described as being made available for research purposes, with access gated by terms.

SIGIR eCom 2021 / Coveo challenge repo: https://github.com/coveooss/SIGIR-ecom-data-challenge

If you’re doing academic work or internal R&D, they can be great. If you’re building a demo catalog that you want to use broadly in customer conversations, the terms can complicate things.

Kaggle competition datasets (H&M example)

Kaggle competitions are a common source of “easy to download” datasets that look demo-friendly. The issue is that many competitions explicitly restrict use to non-commercial purposes.

H&M competition rules (non-commercial clause): https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/rules

Again: excellent for learning, but not the cleanest foundation for customer-facing demos.

TREC Product Search and “Amazon thumbnail” corpora

Some product search datasets include images and evaluation infrastructure, but the provenance matters. For example, the TREC 2023 Product Search track overview explicitly discusses product images extracted from Amazon thumbnails and joined using ASINs.

TREC 2023 Product Search overview: https://arxiv.org/pdf/2311.07861

This may be perfectly fine for research benchmarking, but it reintroduces the same “marketplace-derived content” concern when you want a demo catalog with a clean commercial posture.

Conclusion

If you want to build and demo e-commerce search, the blocker is often the dataset.Open Food Facts and Icecat are two sources that (a) contain the kinds of fields demos need, including images and metadata, and (b) have licensing frameworks that are clear enough to build on without feeling like you’re stepping into a gray area. The real work — and the real value — is in turning raw, awkward source formats into clean, stable NDJSON that is easy to index, easy to query, and easy to use in demos. Not more, not less.

If you’re doing relevance evaluation, WANDS is still in the picture. It’s a different tool for a different job. But for demo catalogs that look and feel real, Icecat and Open Food Facts are the two foundations I’m using today.

Introduction#

The harvesters#

NDJSON#

Dataset #1: Open Food Facts as a demo catalog#

Open Food Facts inclusion criteria#

After: Cleaned NDJSON for grocery search#

Benefits#

How the data looks#

Dataset #2: Icecat for electronics-style product catalogs#

Icecat inclusion criteria (why 25M → 3.5M → 1M)#

After: Cleaned NDJSON for electronics search#

How the data looks#

One schema, two sources#

Why this schema works for search#

Where WANDS fits: evaluation, not demos#

Other datasets we considered (and why we did not choose them)#

Amazon-derived datasets (UCSD / McAuley Lab)#

SIGIR eCom / Coveo data challenge datasets#

Kaggle competition datasets (H&M example)#

TREC Product Search and “Amazon thumbnail” corpora#

Conclusion#

Introduction

The harvesters

NDJSON

Dataset #1: Open Food Facts as a demo catalog

Open Food Facts inclusion criteria

After: Cleaned NDJSON for grocery search

Benefits

How the data looks

Dataset #2: Icecat for electronics-style product catalogs

Icecat inclusion criteria (why 25M → 3.5M → 1M)

After: Cleaned NDJSON for electronics search

How the data looks

One schema, two sources

Why this schema works for search

Where WANDS fits: evaluation, not demos

Other datasets we considered (and why we did not choose them)

Amazon-derived datasets (UCSD / McAuley Lab)

SIGIR eCom / Coveo data challenge datasets

Kaggle competition datasets (H&M example)

TREC Product Search and “Amazon thumbnail” corpora

Conclusion