Analytics repo

The Analytics team's repository (repo-analytics/) demonstrates a traditional data engineering workflow focused on data ingestion, transformation, and analytics reporting. This repository serves as the foundation layer, providing clean, processed data that other teams can build upon.

Repository Structure

The Analytics repository follows standard Dagster project organization patterns:

repo-analytics/
├── dagster_cloud.yaml          # Dagster+ deployment config
├── pyproject.toml             # Python dependencies
├── src/analytics/
│   ├── definitions.py         # Main definitions entry point
│   └── defs/
│       ├── raw_data.py        # Raw data ingestion assets
│       ├── analytics_models.py # Analytics transformations
│       └── io_managers.py     # Resource configurations

The repository uses the defs/ folder pattern, allowing the team to organize their assets logically while ensuring automatic discovery by the definitions entry point.

Raw Data Ingestion

The foundation of the analytics pipeline starts with raw data ingestion assets that simulate connecting to various business systems:

raw_data.py - Customer Data Asset
@dg.asset(group_name="raw_data")
def customer_data() -> pd.DataFrame:
    """Raw customer data from CRM system."""
    # Simulated customer data
    data = {
        "customer_id": [1, 2, 3, 4, 5],
        "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
        "email": [
            "alice@example.com",
            "bob@example.com",
            "charlie@example.com",
            "diana@example.com",
            "eve@example.com",
        ],
        "signup_date": ["2023-01-15", "2023-02-20", "2023-03-10", "2023-04-05", "2023-05-12"],
        "tier": ["premium", "basic", "premium", "basic", "premium"],
    }
    df = pd.DataFrame(data)
    df["signup_date"] = pd.to_datetime(df["signup_date"])

    return df

The raw data assets (customer_data, order_data, product_catalog) simulate typical business data sources. In production, these would connect to actual systems like CRMs, e-commerce platforms, or inventory management systems using appropriate I/O managers and resources.

Each raw data asset is grouped under "raw_data" for clear organization in the Dagster UI, making it easy for users to understand the data lineage and identify source systems.

Analytics Transformations

The Analytics team creates business-ready datasets through a series of transformation assets:

analytics_models.py - Customer Order Summary
@dg.asset(deps=["customer_data", "order_data"], group_name="analytics")
def customer_order_summary(customer_data: pd.DataFrame, order_data: pd.DataFrame) -> pd.DataFrame:
    """Summary of customer orders for analytics."""
    # Join customer and order data
    summary = (
        order_data.groupby("customer_id")
        .agg({"order_id": "count", "total_amount": ["sum", "mean"], "order_date": ["min", "max"]})
        .round(2)
    )

    # Flatten column names
    summary.columns = [
        "total_orders",
        "total_spent",
        "avg_order_value",
        "first_order",
        "last_order",
    ]
    summary = summary.reset_index()

    # Add customer information
    summary = summary.merge(
        customer_data[["customer_id", "name", "tier"]], on="customer_id", how="left"
    )

    return summary

The customer_order_summary asset demonstrates a typical analytics transformation pattern:

Data Joining: Combines customer and order data to create a unified view
Aggregation: Calculates key metrics like total orders, spending, and order patterns
Business Logic: Transforms raw transactional data into analytics-ready formats

This asset serves as a foundation for both internal analytics reporting and as a dependency for the ML team's feature engineering pipeline.

Cross-Team Asset Exposure

The Analytics repository produces assets that are designed to be consumed by other teams:

analytics_models.py - Product Performance Asset
@dg.asset(deps=["order_data", "product_catalog"], group_name="analytics")
def product_performance(order_data: pd.DataFrame, product_catalog: pd.DataFrame) -> pd.DataFrame:
    """Product performance metrics for analytics dashboard."""
    # Calculate product metrics
    product_stats = (
        order_data.groupby("product")
        .agg({"order_id": "count", "quantity": "sum", "total_amount": "sum"})
        .reset_index()
    )

    product_stats.columns = ["product", "orders", "units_sold", "revenue"]

    # Add product catalog information
    performance = product_stats.merge(product_catalog, on="product", how="left")

    # Calculate profit
    performance["total_cost"] = performance["units_sold"] * performance["cost"]
    performance["profit"] = performance["revenue"] - performance["total_cost"]
    performance["profit_margin"] = (performance["profit"] / performance["revenue"] * 100).round(2)

    return performance

The product_performance asset creates comprehensive product metrics that include:

Sales Metrics: Orders, units sold, and revenue
Profitability Analysis: Cost calculations and profit margins
Catalog Integration: Links sales data with product catalog information

This asset is specifically designed to be consumed by the ML team for product recommendation and demand forecasting models.

Asset Organization and Groups

The Analytics repository uses logical asset groups to organize functionality:

raw_data group: Source data from external systems
analytics group: Transformed business metrics and reports

This grouping strategy makes it easy for both the Analytics team and downstream consumers to understand the purpose and maturity level of different assets. The clear separation between raw and processed data also helps establish data governance boundaries.

Shared Resource Configuration

All assets in the Analytics repository use the shared FilesystemIOManager configured in the definitions file:

definitions.py - Shared I/O Manager
        # Add shared I/O manager for cross-repository access
        resources={
            **defs_from_folder.resources,
            "io_manager": dg.FilesystemIOManager(base_dir="~/Documents/dagster_shared_assets"),
        },

This shared storage configuration ensures that assets materialized by the Analytics team can be accessed by other code locations, particularly the ML team that depends on analytics outputs for their feature engineering pipeline.

Next steps

Now that we understand the Analytics repository's data foundation layer, we can explore how the ML Platform team builds sophisticated machine learning workflows on top of this analytics data.

Continue this tutorial with ML Repository

Repository Structure​

Raw Data Ingestion​

Analytics Transformations​

Cross-Team Asset Exposure​

Asset Organization and Groups​

Shared Resource Configuration​

Next steps​