Data Flow¶
End-to-End Data Flow¶
The FFG pipeline processes data through multiple stages, transforming raw inputs into business-ready analytics.
Data Flow Diagram¶
flowchart LR
subgraph "Input Sources"
A1[Excel Files<br/>Sell-Out Data]
A2[CSV Files<br/>Mappings]
A3[SQL Tables<br/>PL Global]
A4[Excel Files<br/>RSP Data]
A5[MDM Tables<br/>Master Data]
end
subgraph "Bronze Layer"
B1[bronze_sellout]
B2[bronze_sellin]
B3[bronze_rsp]
B4[Mapping Tables]
end
subgraph "Processing"
C1[Enrichment]
C2[Validation]
C3[Aggregation]
C4[Ventilation]
end
subgraph "Gold Layer"
D1[gold_sellout]
D2[gold_sellin]
D3[gold_rsp]
D4[gold_ventilated]
D5[gold_consolidated]
end
A1 --> B1
A2 --> B4
A3 --> B2
A4 --> B3
A5 --> C1
B1 --> C1
B2 --> C1
B3 --> C1
B4 --> C1
C1 --> C2
C2 --> C3
C3 --> C4
C4 --> D1
C4 --> D2
C4 --> D3
C4 --> D4
D4 --> D5
Stage 1: Data Ingestion¶
Sell-Out Data Flow¶
- Source: Excel files uploaded to Azure Blob Storage
- Location:
/sellout_data/incoming/
- Processing:
- Header detection
- Column standardization
- Data type conversion
- Output: Bronze sellout table
Sell-In Data Flow¶
- Source: SQL queries to pl_global table
- Filters:
- Date range: Previous year to current
- Division: Luxury only
- Reporting units in scope
- Aggregation: By fiscal year, month, and dimensions
- Output: Bronze sellin table
RSP Data Flow¶
- Source: Excel files with price data
- Processing:
- Market to reporting unit mapping
- FX rate conversion to EUR
- Inheritance application
- Date Expansion: One row per day per EAN
- Output: Bronze and gold RSP tables
Stage 2: Enrichment & Mapping¶
Mapping Application Flow¶
Sell-Out Mappings¶
- Retailer Mapping: Links retailers to customers and countries
- Company Mapping: Maps brands to companies
- Product Mapping: Adds product details via UPC
Sell-In Mappings¶
- Region Mapping: Assigns regions to reporting units
- Customer Info: Adds customer classifications
- Freight & Royalty: Applies percentages
- FX Rates: Converts LC to EUR
- HFM Groups: Creates financial aggregations
MDM Enrichment Flow¶
Enrichment sources: - Material master - Customer hierarchy - Brand/House hierarchies - Product hierarchies
Stage 3: Transformation & Aggregation¶
HFM Processing¶
- Direct Accounts: Individual HFM account aggregations
- PNL Lines: Grouped account calculations
- Trade Terms: Category-based groupings
- Special Cases: COGS subtractions for specific accounts
Aggregation Levels¶
House Level¶
- Dimensions: Brand × Retailer × Reporting Unit × House
- Metrics: Gross sales, on/off invoice, freight
- Special handling for multiline brands
Product Level¶
- Dimensions: All house dimensions + Material key
- Metrics: Individual product performance
- Percentage allocations from house level
Stage 4: Ventilation Process¶
Ventilation Flow¶
flowchart TB
A[Gold Sell-In Data]
B[Compute Derived Columns]
C[House Level Aggregation]
D[Product Level Aggregation]
E[Percentage Calculations]
F[Allocation Distribution]
G[KPI Calculation]
H[Final Ventilated Data]
A --> B
B --> C
B --> D
C --> E
D --> E
E --> F
F --> G
G --> H
Key Ventilation Rules¶
- Regular Items Only: Filter article_type = "Regular Item"
- Quality Filters: Apply cleaning_total = 1
- Not Assigned Handling: Special allocation logic
- Exception Cases: Douglas, ICI Paris XL, BENELUX
Stage 5: Consolidation¶
Master Table Creation¶
Key joins: - Date alignment - EAN/UPC matching - Reporting unit consistency
Stage 6: Output & Archive¶
Output Flow¶
- Gold Tables: Business-ready data
- Processing Markers: Completion indicators
- Power Automate: Trigger downstream processes
- Archival: Move processed files
Archive Structure¶
Data Quality Checkpoints¶
Validation Points¶
- Input Validation
- File format checks
- Required columns verification
-
Data type validation
-
Join Validation
- No data inflation
- Key matching rates
-
Null handling
-
Business Rules
- Threshold checks
- Exception handling
-
Consistency validation
-
Output Validation
- Row count verification
- KPI reasonability
- Completeness checks
Performance Optimization¶
Caching Strategy¶
- Cache frequently accessed DataFrames
- Clear cache after major operations
- Monitor memory usage
Join Optimization¶
- Broadcast small lookup tables
- Use composite keys for efficiency
- Pre-filter data before joins
Aggregation Optimization¶
- Single pass aggregations
- Minimize shuffles
- Use window functions efficiently
Error Handling Flow¶
flowchart LR
A[Process Step] --> B{Success?}
B -->|Yes| C[Next Step]
B -->|No| D[Log Error]
D --> E{Recoverable?}
E -->|Yes| F[Retry Logic]
E -->|No| G[Graceful Failure]
F --> A
G --> H[Alert & Exit]
Data Lineage¶
Traceability¶
- Processing timestamps at each stage
- Source file tracking
- Transformation audit trail
- User/job identification
Versioning¶
- Pipeline version in metadata
- Schema evolution support
- Backward compatibility