Project: E-Commerce CSV-to-Parquet Optimization

Summary

At an e-commerce company, online retailer of locally sourced gourmet foods and artisanal home goods, targeting Spanish consumers and EU tourists, supplier inventory CSVs were causing slow analytics and schema inconsistencies. I built a pipeline that auto-converts them to Parquet and resolves messy schemas, cutting Athena costs by 60% and speeding up queries. This let the team adjust pricing strategies faster.

Business Problem

An e-commerce startup receives daily CSV inventory feeds from 20+ small suppliers. While CSV was initially simple, performance and cost issues have compounded as data volume grows.

Current Data Sources & Architecture

Table 1

sku
product_name
current_stock
price_eur
discount_price
reorder_threshold
last_delivery_date

Table 2

product_id
product_name
current_stock
price
discount_price
lead_time_days
reorder_threshold
last_delivery_date
supplier_notes

Sample supplier CSV schema and common inconsistencies.

Pain Points

Proposed Solution

Implement a serverless ETL pipeline to convert raw CSVs into partitioned Parquet files, leveraging AWS Glue, Lambda, and Spark for scalability and cost efficiency.

ETL processing architecture
Figure 2: Lambda-triggered AWS Glue Jobs ingest CSV to Parquet, with Spark for large batches.

Key Components

Business Impact

Thanks for reading! πŸ§‘β€πŸ’»πŸ’•