Small Business Data Lake for Customer Insights

Summary

A small grocery chain in Lima, Perú, struggled to understand the impact of their sales promotions. Their sales, loyalty, and invoice data were siloed across multiple systems, making it nearly impossible to track customer behavior or campaign effectiveness. This project unified their data sources into an S3-based data lake, enabling real-time insights and smarter promotion strategies.

Business Challenge

With three stores in working-class districts, the store aimed to compete with traditional corner shops by offering affordable staples and loyalty incentives. However, their data was fragmented:

The owners couldn’t determine if a promotion actually increased revenue or engagement. For example, a “Free Oil with 5 Visits” campaign caused unintended losses—customers redeemed oil but skipped other higher-margin products. No one realized the impact for weeks.

Solution and Architecture

I designed and implemented a unified data platform using AWS services. All incoming data was stored in an Amazon S3 data lake with clearly defined raw zones. Glue and Spark jobs handled cleaning, joining, and extracting structured data—including using NLP for text extraction from invoice PDFs.

Architecture Diagram of Data Lake Pipeline for La Barata

Steps Taken

Impact and Results

The store transitioned from reactive guesses to data-driven promotion planning. Key outcomes:

This project helped a small Latin American business leverage cloud analytics and natural language processing to compete smarter and optimize every peso spent on promotions.