Skip to main content

Analytics repo

The Analytics team's repository (repo-analytics/) demonstrates a traditional data engineering workflow focused on data ingestion, transformation, and analytics reporting. This repository serves as the foundation layer, providing clean, processed data that other teams can build upon.

Repository Structure

The Analytics repository follows standard Dagster project organization patterns:

repo-analytics/
├── dagster_cloud.yaml # Dagster+ deployment config
├── pyproject.toml # Python dependencies
├── src/analytics/
│ ├── definitions.py # Main definitions entry point
│ └── defs/
│ ├── raw_data.py # Raw data ingestion assets
│ ├── analytics_models.py # Analytics transformations
│ └── io_managers.py # Resource configurations

The repository uses the defs/ folder pattern, allowing the team to organize their assets logically while ensuring automatic discovery by the definitions entry point.

Raw Data Ingestion

The foundation of the analytics pipeline starts with raw data ingestion assets that simulate connecting to various business systems:

raw_data.py - Customer Data Asset
@dg.asset(group_name="raw_data")
def customer_data() -> pd.DataFrame:
"""Raw customer data from CRM system."""
# Simulated customer data
data = {
"customer_id": [1, 2, 3, 4, 5],
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"email": [
"alice@example.com",
"bob@example.com",
"charlie@example.com",
"diana@example.com",
"eve@example.com",
],
"signup_date": ["2023-01-15", "2023-02-20", "2023-03-10", "2023-04-05", "2023-05-12"],
"tier": ["premium", "basic", "premium", "basic", "premium"],
}
df = pd.DataFrame(data)
df["signup_date"] = pd.to_datetime(df["signup_date"])

return df

The raw data assets (customer_data, order_data, product_catalog) simulate typical business data sources. In production, these would connect to actual systems like CRMs, e-commerce platforms, or inventory management systems using appropriate I/O managers and resources.

Each raw data asset is grouped under "raw_data" for clear organization in the Dagster UI, making it easy for users to understand the data lineage and identify source systems.

Analytics Transformations

The Analytics team creates business-ready datasets through a series of transformation assets:

analytics_models.py - Customer Order Summary
@dg.asset(deps=["customer_data", "order_data"], group_name="analytics")
def customer_order_summary(customer_data: pd.DataFrame, order_data: pd.DataFrame) -> pd.DataFrame:
"""Summary of customer orders for analytics."""
# Join customer and order data
summary = (
order_data.groupby("customer_id")
.agg({"order_id": "count", "total_amount": ["sum", "mean"], "order_date": ["min", "max"]})
.round(2)
)

# Flatten column names
summary.columns = [
"total_orders",
"total_spent",
"avg_order_value",
"first_order",
"last_order",
]
summary = summary.reset_index()

# Add customer information
summary = summary.merge(
customer_data[["customer_id", "name", "tier"]], on="customer_id", how="left"
)

return summary

The customer_order_summary asset demonstrates a typical analytics transformation pattern:

  • Data Joining: Combines customer and order data to create a unified view
  • Aggregation: Calculates key metrics like total orders, spending, and order patterns
  • Business Logic: Transforms raw transactional data into analytics-ready formats

This asset serves as a foundation for both internal analytics reporting and as a dependency for the ML team's feature engineering pipeline.

Cross-Team Asset Exposure

The Analytics repository produces assets that are designed to be consumed by other teams:

analytics_models.py - Product Performance Asset
@dg.asset(deps=["order_data", "product_catalog"], group_name="analytics")
def product_performance(order_data: pd.DataFrame, product_catalog: pd.DataFrame) -> pd.DataFrame:
"""Product performance metrics for analytics dashboard."""
# Calculate product metrics
product_stats = (
order_data.groupby("product")
.agg({"order_id": "count", "quantity": "sum", "total_amount": "sum"})
.reset_index()
)

product_stats.columns = ["product", "orders", "units_sold", "revenue"]

# Add product catalog information
performance = product_stats.merge(product_catalog, on="product", how="left")

# Calculate profit
performance["total_cost"] = performance["units_sold"] * performance["cost"]
performance["profit"] = performance["revenue"] - performance["total_cost"]
performance["profit_margin"] = (performance["profit"] / performance["revenue"] * 100).round(2)

return performance

The product_performance asset creates comprehensive product metrics that include:

  • Sales Metrics: Orders, units sold, and revenue
  • Profitability Analysis: Cost calculations and profit margins
  • Catalog Integration: Links sales data with product catalog information

This asset is specifically designed to be consumed by the ML team for product recommendation and demand forecasting models.

Asset Organization and Groups

The Analytics repository uses logical asset groups to organize functionality:

  • raw_data group: Source data from external systems
  • analytics group: Transformed business metrics and reports

This grouping strategy makes it easy for both the Analytics team and downstream consumers to understand the purpose and maturity level of different assets. The clear separation between raw and processed data also helps establish data governance boundaries.

Shared Resource Configuration

All assets in the Analytics repository use the shared FilesystemIOManager configured in the definitions file:

definitions.py - Shared I/O Manager
        # Add shared I/O manager for cross-repository access
resources={
**defs_from_folder.resources,
"io_manager": dg.FilesystemIOManager(base_dir="~/Documents/dagster_shared_assets"),
},

This shared storage configuration ensures that assets materialized by the Analytics team can be accessed by other code locations, particularly the ML team that depends on analytics outputs for their feature engineering pipeline.

Next steps

Now that we understand the Analytics repository's data foundation layer, we can explore how the ML Platform team builds sophisticated machine learning workflows on top of this analytics data.