Skip to main content

Multi-Repository Code Locations

In this tutorial, you'll build a multi-repository data platform with Dagster+ that:

  • Separates Analytics and ML teams into independent repositories
  • Enables cross-repository asset dependencies and data sharing
  • Implements shared resource configurations for seamless data flow
  • Demonstrates independent deployment cycles for different teams
  • Shows how to coordinate production deployments across repositories

You will learn to:

  • Set up multiple code locations with independent repositories
  • Configure shared storage for cross-repository asset access
  • Declare and manage cross-repository asset dependencies
  • Organize teams with different development and deployment schedules
  • Deploy multiple code locations to Dagster+ with proper coordination
  • Monitor and maintain cross-repository data pipelines

Prerequisites

To follow the steps in this tutorial, you'll need:

  • Python 3.9+ installed. For more information, see the Installation guide.
  • A Dagster+ account for deployment examples.
  • Familiarity with Python, data pipelines, and basic machine learning concepts.
  • Understanding of Git workflows and repository management.

Architecture Overview

This example demonstrates a realistic multi-team scenario with two separate repositories:

  • Analytics Team Repository (repo-analytics/): Handles data ingestion, transformation, and business reporting
  • ML Platform Team Repository (repo-ml/): Manages feature engineering, model training, and predictions

Despite being in separate repositories, assets in one code location can depend on assets from another code location, enabling cross-team collaboration while maintaining clear organizational boundaries.

Multi-Repository Structure

Each repository is structured as an independent Dagster project with its own configuration:

workspace.yaml
load_from:
- python_package:
package_name: analytics.definitions
location_name: analytics-team
working_directory: ./repo-analytics
- python_package:
package_name: ml_platform.definitions
location_name: ml-platform
working_directory: ./repo-ml

The workspace configuration defines two separate code locations, each pointing to a different Python package and working directory. This allows both repositories to be loaded simultaneously in a single Dagster instance while maintaining clear separation.

Cross-Repository Dependencies

The example demonstrates how assets in the ML repository can depend on assets from the Analytics repository:

  • customer_features depends on customer_order_summary (from analytics)
  • product_features depends on product_performance (from analytics)

These dependencies are handled through explicit asset key references and shared storage, enabling cross-team data collaboration while maintaining repository independence.

Shared Resource Configuration

Both repositories use a shared I/O manager configuration to enable cross-repository asset dependencies:

repo-analytics/src/analytics/definitions.py
        # Add shared I/O manager for cross-repository access
resources={
**defs_from_folder.resources,
"io_manager": dg.FilesystemIOManager(base_dir="~/Documents/dagster_shared_assets"),
},

The FilesystemIOManager with a shared base directory ensures that assets materialized in one repository can be accessed by assets in another repository. In production, this would typically be replaced with cloud storage like S3 or GCS.

Next steps