Architecture Overview¶
System Architecture¶
The FFG Data Pipeline implements a modern lakehouse architecture on Databricks, combining the best of data lakes and data warehouses.
High-Level Architecture¶
graph TB
subgraph "Data Sources"
A1[Azure Blob Storage]
A2[SQL Data Warehouse]
A3[MDM Systems]
end
subgraph "Processing Layer"
B1[Databricks Spark]
B2[SQL Warehouse]
B3[Delta Lake]
end
subgraph "Data Layers"
C1[Bronze Layer]
C2[Silver Processing]
C3[Gold Layer]
end
subgraph "Consumers"
D1[Power BI]
D2[Analytics]
D3[APIs]
end
A1 --> B1
A2 --> B2
A3 --> B2
B1 --> C1
B2 --> C1
C1 --> C2
C2 --> C3
C3 --> D1
C3 --> D2
C3 --> D3
Component Architecture¶
1. Data Sources¶
Azure Blob Storage¶
- Container:
domain-finance-ffg
- Structure: Organized into incoming/processed/archive folders
- File Types: Excel files for sell-out and RSP, CSV for mappings
SQL Data Warehouse¶
- Database:
cotydatacloud_pro
- Schemas:
gold_finance
,gold_mdm
- Tables: pl_global, material, customer, brand hierarchies
2. Processing Components¶
Main Orchestrator (main.py
)¶
The central control unit that coordinates all pipeline steps: - Initializes processing environment - Manages step execution order - Handles error recovery - Triggers archival processes
Pipeline Steps¶
- Sell-Out Processing (
step_01_sellout_processing.py
) - Loads Excel source data
- Applies retailer, company, and product mappings
- Calculates discount metrics
-
Prepares gold layer
-
Sell-In Processing (
step_02_sellin_processing.py
) - Queries SQL warehouse for pl_global data
- Applies HFM aggregations
- Enriches with MDM attributes
-
Applies business mappings
-
RSP Processing (
step_03_rsp_processing.py
) - Loads price data
- Applies market mappings
- Handles inheritance logic
-
Expands date ranges
-
Ventilation & Allocation (
step_04_ventilation_allocation.py
) - Creates house-level aggregations
- Calculates product-level allocations
-
Applies special business rules
-
Master Consolidation (
step_05_master_consolidation.py
) -
Placeholder for consolidation logic
-
Corridor Analysis (
step_06_corridor_analysis.py
) - Placeholder for corridor analysis logic
3. Data Model¶
Medallion Architecture¶
Bronze Layer - Raw data with minimal transformations - Preserves original data structure - Includes processing timestamps
Silver Layer (Processing) - Business logic applied - Data standardization - Enrichment and mappings
Gold Layer - Business-ready datasets - Optimized for analytics - Pre-calculated KPIs
4. Technical Stack¶
Core Technologies¶
- Apache Spark: Distributed processing
- Delta Lake: ACID transactions, time travel
- Python 3.8+: Primary development language
- PySpark: Spark interface
Azure Services¶
- Databricks: Unified analytics platform
- Azure Data Lake Storage Gen2: Scalable storage
- Azure Key Vault: Secret management
- Power Automate: Workflow automation
Data Flow¶
1. Ingestion Flow¶
2. Processing Flow¶
3. Ventilation Flow¶
Key Design Patterns¶
1. Modular Architecture¶
Each processing step is self-contained with: - Dedicated loader classes - Enrichment processors - Mapping handlers - Gold layer preparers
2. Configuration-Driven¶
All settings centralized in config.py
:
- Table names
- File paths
- Business rules
- Thresholds
3. Error Handling¶
Comprehensive error management: - Try-catch blocks at each level - Detailed logging - Graceful degradation - Recovery mechanisms
4. Performance Optimization¶
- DataFrame caching for reused data
- Broadcast joins for small lookups
- Composite keys for efficient joins
- Spark optimization techniques
Security Architecture¶
Authentication¶
- Service principal authentication for Azure
- Databricks secrets for credentials
- Key Vault integration
Data Security¶
- Encryption at rest (Azure Storage)
- Encryption in transit (TLS)
- Role-based access control
- Audit logging
Scalability Considerations¶
Horizontal Scaling¶
- Spark cluster auto-scaling
- Partitioned data storage
- Distributed processing
Vertical Scaling¶
- Configurable cluster sizes
- Memory optimization
- Cache management
Performance Tuning¶
- Adaptive query execution
- Dynamic partition pruning
- Cost-based optimization
Monitoring & Observability¶
Logging¶
- Structured logging with levels
- Section markers for tracking
- Performance metrics
- Error tracking
Alerts¶
- Power Automate notifications
- Email alerts for failures
- Slack integration
Metrics¶
- Processing times
- Row counts
- Data quality scores
- Resource utilization
Deployment Architecture¶
Environment Setup¶
CI/CD Pipeline¶
- Git-based version control
- Automated testing
- Databricks job deployment
- Configuration management
Disaster Recovery¶
- Data backup in archive folders
- Point-in-time recovery with Delta
- Redundant storage
- Failover procedures