Owner: Muhammad Iman Qayyum Sufi Bin Mohamed Sufi
Email: [email protected]
Created On: January 2025
Project Overview:
This project focuses on building a Lakehouse data pipeline to process fitness data efficiently.
I will implement a structured data flow (Bronze, Silver, Gold layers) using both batch processing and Spark Streaming to handle real-time and historical data efficiently. The pipeline will ingest and process user workout and gym activity data as the goal.
Technology Used:
- Programming Language - Python, PySpark, Spark Structured Streaming
- Scripting Language - SQL
- Microsoft Azure
- Azure Data Lake Storage
- Azure Databricks with Unity Catalog
- Azure DevOps
Requirements:
- Lakehouse Architecture – Implement a Lakehouse platform using the medallion architecture (Bronze, Silver, Gold) for structured data storage and processing.
- Data Ingestion – Collect and ingest fitness data (workouts, BPM, logins) from APIs, databases, and Kafka, supporting both batch and streaming workflows.
- Data Processing – Transform raw data into cleaned and aggregated datasets using Databricks, PySpark, and Delta Lake, ensuring data quality and efficiency.
- Analytics & Reporting – Prepare Workout BPM Summary and Gym Summary datasets for insights, dashboards, and reporting.
- Security & Automation – Implement role-based access control (RBAC) from Unity Catalog, CI/CD pipelines, and automated testing for deployment and data validation.
- Scalability & Performance – Design for high availability, scalability, and cost efficiency, optimizing queries and storage for real-time and batch processing.
About the datasets:
