Skip to content

Architecture Overview

System Architecture

The FFG Data Pipeline implements a modern lakehouse architecture on Databricks, combining the best of data lakes and data warehouses.

High-Level Architecture

graph TB
    subgraph "Data Sources"
        A1[Azure Blob Storage]
        A2[SQL Data Warehouse]
        A3[MDM Systems]
    end

    subgraph "Processing Layer"
        B1[Databricks Spark]
        B2[SQL Warehouse]
        B3[Delta Lake]
    end

    subgraph "Data Layers"
        C1[Bronze Layer]
        C2[Silver Processing]
        C3[Gold Layer]
    end

    subgraph "Consumers"
        D1[Power BI]
        D2[Analytics]
        D3[APIs]
    end

    A1 --> B1
    A2 --> B2
    A3 --> B2
    B1 --> C1
    B2 --> C1
    C1 --> C2
    C2 --> C3
    C3 --> D1
    C3 --> D2
    C3 --> D3

Component Architecture

1. Data Sources

Azure Blob Storage

  • Container: domain-finance-ffg
  • Structure: Organized into incoming/processed/archive folders
  • File Types: Excel files for sell-out and RSP, CSV for mappings

SQL Data Warehouse

  • Database: cotydatacloud_pro
  • Schemas: gold_finance, gold_mdm
  • Tables: pl_global, material, customer, brand hierarchies

2. Processing Components

Main Orchestrator (main.py)

The central control unit that coordinates all pipeline steps: - Initializes processing environment - Manages step execution order - Handles error recovery - Triggers archival processes

Pipeline Steps

  1. Sell-Out Processing (step_01_sellout_processing.py)
  2. Loads Excel source data
  3. Applies retailer, company, and product mappings
  4. Calculates discount metrics
  5. Prepares gold layer

  6. Sell-In Processing (step_02_sellin_processing.py)

  7. Queries SQL warehouse for pl_global data
  8. Applies HFM aggregations
  9. Enriches with MDM attributes
  10. Applies business mappings

  11. RSP Processing (step_03_rsp_processing.py)

  12. Loads price data
  13. Applies market mappings
  14. Handles inheritance logic
  15. Expands date ranges

  16. Ventilation & Allocation (step_04_ventilation_allocation.py)

  17. Creates house-level aggregations
  18. Calculates product-level allocations
  19. Applies special business rules

  20. Master Consolidation (step_05_master_consolidation.py)

  21. Placeholder for consolidation logic

  22. Corridor Analysis (step_06_corridor_analysis.py)

  23. Placeholder for corridor analysis logic

3. Data Model

Medallion Architecture

Bronze Layer - Raw data with minimal transformations - Preserves original data structure - Includes processing timestamps

Silver Layer (Processing) - Business logic applied - Data standardization - Enrichment and mappings

Gold Layer - Business-ready datasets - Optimized for analytics - Pre-calculated KPIs

4. Technical Stack

Core Technologies

  • Apache Spark: Distributed processing
  • Delta Lake: ACID transactions, time travel
  • Python 3.8+: Primary development language
  • PySpark: Spark interface

Azure Services

  • Databricks: Unified analytics platform
  • Azure Data Lake Storage Gen2: Scalable storage
  • Azure Key Vault: Secret management
  • Power Automate: Workflow automation

Data Flow

1. Ingestion Flow

Source Files → Azure Blob → Spark Read → Bronze Tables

2. Processing Flow

Bronze → Enrichment → Validation → Aggregation → Gold

3. Ventilation Flow

Gold Sell-In → House Level → Product Level → Consolidated

Key Design Patterns

1. Modular Architecture

Each processing step is self-contained with: - Dedicated loader classes - Enrichment processors - Mapping handlers - Gold layer preparers

2. Configuration-Driven

All settings centralized in config.py: - Table names - File paths - Business rules - Thresholds

3. Error Handling

Comprehensive error management: - Try-catch blocks at each level - Detailed logging - Graceful degradation - Recovery mechanisms

4. Performance Optimization

  • DataFrame caching for reused data
  • Broadcast joins for small lookups
  • Composite keys for efficient joins
  • Spark optimization techniques

Security Architecture

Authentication

  • Service principal authentication for Azure
  • Databricks secrets for credentials
  • Key Vault integration

Data Security

  • Encryption at rest (Azure Storage)
  • Encryption in transit (TLS)
  • Role-based access control
  • Audit logging

Scalability Considerations

Horizontal Scaling

  • Spark cluster auto-scaling
  • Partitioned data storage
  • Distributed processing

Vertical Scaling

  • Configurable cluster sizes
  • Memory optimization
  • Cache management

Performance Tuning

  • Adaptive query execution
  • Dynamic partition pruning
  • Cost-based optimization

Monitoring & Observability

Logging

  • Structured logging with levels
  • Section markers for tracking
  • Performance metrics
  • Error tracking

Alerts

  • Power Automate notifications
  • Email alerts for failures
  • Slack integration

Metrics

  • Processing times
  • Row counts
  • Data quality scores
  • Resource utilization

Deployment Architecture

Environment Setup

Development → Staging → Production

CI/CD Pipeline

  • Git-based version control
  • Automated testing
  • Databricks job deployment
  • Configuration management

Disaster Recovery

  • Data backup in archive folders
  • Point-in-time recovery with Delta
  • Redundant storage
  • Failover procedures