A Data Lake for
E-commerce Product Intelligence

Production-grade product datasets delivered as partitioned Apache Parquet in a cloud-native lake architecture. Built for analytics teams, consultants, and builders who need repeatable, query-ready data.

🎯 Lead magnet: guide + updates

Get our competitive analysis guide and product updates (separate from the 10K Parquet sample above).

No spam. Unsubscribe anytime. 2,847 downloads this month.

Live pulse

Rollups from the published catalog index in D1 (refreshes as you ingest new partitions).

Total rows indexed

β€”

Partition categories

β€”

Retailer footprint (estimate)

β€”

Data lake architecture

Partitioned Parquet in R2, served by an API layer

Your deliverable is not β€œa CSV download.” It’s a lake-style layout you can ingest into DuckDB, Polars, Spark, BigQuery, Snowflake, or any Parquet-native workflow.

Partitioning

data/year=YYYY/quarter=Q#/month=M/category=Category_Slug/

Manifests + catalog

Each dataset includes metadata (row counts, schema version, checksums) for repeatable ingestion.

Secure access

Customers authenticate, entitlement is checked, then Parquet is streamed from R2.

What you get

R2 Data Lake
Partitioned Parquet objects
data/year=2026/…/category=Electronics/*.parquet
↓
API Layer (Worker)
Catalog + auth + downloads
/api/catalog /api/me /api/download
Your Analytics
DuckDB / Polars / Spark
read_parquet(\"*.parquet\")
Start with the free Parquet sample, then purchase entitlements for full category drops.

Designed for repeatability: stable schemas, predictable keys, and an API-backed catalog.

Production format

10,000-row Apache Parquet sample

The same column layout and file format as commercial drops: partitioned product-link datasets delivered as Parquet for analytics pipelines (Spark, DuckDB, Polars, BigQuery load, and more).

  • Exactly 10,000 rows in one .parquet file
  • Schema aligned with monthly year=/quarter=/month=/category= releases
  • Marketing use only β€” see license note on download

Typical columns

Exact fields vary slightly by retailer feed; exports include a manifest.

  • product_url
  • site_domain
  • title
  • brand
  • price
  • currency
  • availability
  • sku
  • breadcrumbs
  • primary_image
  • images
  • description
  • extraction_timestamp
  • confidence
  • http_status
  • quality_score
  • kde_sector
  • kde_subcategory
  • kde_score

Host the file on your Worker (/public/samples/kodira_sample_10k.parquet) or set window.KODIRA_CONFIG.apiBase in config.js so this button resolves.

Trusted by researchers and businesses

500K+
Products Tracked
50+
Major Retailers
98.7%
Data Accuracy

See the Data Quality

Below is a tiny HTML preview (five rows). Your real deliverable is Parquet at scaleβ€”start with the 10,000-row sample file. All data is collected from publicly visible product pages.

Representative rows (5 of many millions in production)

Format sold: Parquet Free structured sample: 10K rows
product_url site_domain title brand price currency availability category extraction_timestamp
…/guess-laptop-bag/ 99percents.com Guess Laptop Bag β€” 50000.00 NGN out_of_stock Bags_Travel 2026-04-09T…Z
…/once-upon-a-wrinkle 100percentpure.com Once Upon a Wrinkle β€” 34.00 β€” unknown Footwear 2026-04-09T…Z
…/turntable-cartridges… 2001audiovideo.com Turntable Cartridges and Stylus β€” 249.00 β€” unknown Gaming_Peripherals 2026-04-07T…Z
Download 10K Parquet sample Parquet sample is the best representation of production files.

What's In Our Product Datasets

πŸ“±

Product Information

Names, descriptions, SKUs, categories, specifications, and detailed product attributes

πŸ’°

Pricing Data

Current prices, sale prices, discounts, price history, and competitive pricing intelligence

⭐

Reviews & Ratings

Star ratings, review counts, review sentiment, and customer feedback analysis

πŸ“¦

Availability Status

Stock levels, availability indicators, shipping information, and inventory tracking

πŸͺ

Retailer Information

Store names, URLs, seller details, and marketplace identifiers for complete attribution

πŸ“Š

Market Analytics

Trend data, category insights, seasonal patterns, and competitive positioning metrics

How Businesses Use Our Product Data

πŸ”

Competitive Price Analysis

Track competitor pricing strategies, identify pricing gaps, and optimize your own pricing for maximum profitability.

Saves $10,000+ in consultant fees
πŸ“ˆ

Market Research

Analyze product trends, seasonal patterns, and consumer preferences to inform product development and marketing strategies.

Replaces $25,000+ research reports
🎯

Product Sourcing

Discover new products, identify popular items, and find profitable niches in various retail categories.

Identifies 6-figure opportunities
πŸ“Š

Brand Monitoring

Track how your brand and products are positioned across different retailers and marketplaces.

Prevents costly positioning mistakes

Plans & catalogue pricing

Subscriptions unlock the live catalog and ongoing partitions in R2. One-time purchases buy a specific Parquet snapshot. Configure Stripe Price IDs on the Worker (STRIPE_PRICE_SUB_*) after you run the D1 migration for tiers.

Subscriptions (recurring)

Niche access

$199/mo
1 vertical (one partition category)
  • Monthly access to latest partitions for your category
  • Apache Parquet lake layout
  • Row pool for capped exports (see Worker webhook defaults)
Brand managers & niche sellers

Direct lake access

$1,250/mo
Full bucket catalog (millions of rows)
  • All active partition categories
  • Highest row pool for heavy exports
  • Built for hedge funds, AI labs, and data platforms
Hedge funds & AI developers

Catalogue (one-time)

Product Rate Example
April baseline snapshot $0.85 / 1,000 rows 100k rows β†’ $85.00
May signal data (with % changes) $3.50 / 1,000 rows 100k rows β†’ $350.00

Row counts and data_tier (baseline vs signal) live on each dataset_catalog row; checkout uses quoted cents automatically.

Full catalogue buyout

$1,500
One-time β€” April baseline across all verticals in the catalog snapshot
  • Single payment for a full-catalog entitlement flag in D1
  • Downloads use a single-use 15-minute Worker link per file (see account page)
Customer flow: Splash β†’ blurred public preview (domains only) β†’ Supabase sign-in β†’ Stripe checkout β†’ Worker validates session + entitlements before streaming Parquet.
Security: R2 stays private; only the Worker streams objects. Throttle abusive preview IPs in Cloudflare WAF; catalog previews never return raw SKUs or exact prices.
Enterprise: Need custom SLAs, VPC egress, or private connectors? Contact sales.

How We Collect Product Data

01

Ethical Web Scraping

We only collect publicly visible product information using respectful scraping practices. All robots.txt files are honored and rate limits are strictly followed.

02

Data Validation & Cleaning

Every product record goes through automated validation: price format checking, duplicate removal, and data completeness scoring to ensure high quality.

03

Privacy Compliant

We only collect product information visible to any website visitor. No personal data, no customer information, no backend systems - just public product catalogs.

04

Regular Updates

Product data is refreshed daily to capture price changes, new products, and availability updates. You always get the most current market information.

Ready to Get Started?

Questions about our product data? Need a custom dataset for your industry? Let's talk.

Email hello@kodira.dev

Typical response time: Under 2 hours during business hours