terraform-aws-metaflow icon indicating copy to clipboard operation
terraform-aws-metaflow copied to clipboard

Deploy production-grade Metaflow cloud infrastructure on AWS

Metaflow Terraform module

Terraform module that provisions AWS resources to run Metaflow in production.

This module consists of submodules that can be used separately as well:

  • AWS Batch cluster to run Metaflow steps (metaflow-computation)
  • blob storage and metadata database (metaflow-datastore)
  • a service providing API to record and query past executions (metaflow-metadata-service)
  • resources to deploy Metaflow flows on Step Functions processing (metaflow-step-functions)
  • Metaflow UI(metaflow-ui)

modules diagram

You can either use this high-level module, or submodules individually. See each submodule's corresponding README.md for more details.

Here's a minimal end-to-end example of using this module with VPC:

# Random suffix for this deployment
resource "random_string" "suffix" {
  length  = 8
  special = false
  upper = false
}

locals {
  resource_prefix = "metaflow"
  resource_suffix = random_string.suffix.result
}

data "aws_availability_zones" "available" {
}

# VPC infra using https://github.com/terraform-aws-modules/terraform-aws-vpc
module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  version = "3.13.0"

  name = "${local.resource_prefix}-${local.resource_suffix}"
  cidr = "10.10.0.0/16"

  azs             = data.aws_availability_zones.available.names
  private_subnets = ["10.10.8.0/21", "10.10.16.0/21", "10.10.24.0/21"]
  public_subnets  = ["10.10.128.0/21", "10.10.136.0/21", "10.10.144.0/21"]

  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true
}


module "metaflow" {
  source = "outerbounds/metaflow/aws"
  version = "0.3.0"

  resource_prefix = local.resource_prefix
  resource_suffix = local.resource_suffix

  enable_step_functions = false
  subnet1_id            = module.vpc.public_subnets[0]
  subnet2_id            = module.vpc.public_subnets[1]
  vpc_cidr_blocks       = module.vpc.vpc_cidr_blocks
  vpc_id                = module.vpc.vpc_id
  with_public_ip        = true

  tags = {
      "managedBy" = "terraform"
  }
}

# export all outputs from metaflow modules
output "metaflow" {
  value = module.metaflow
}

# The module will generate a Metaflow config in JSON format, write it to a file
resource "local_file" "metaflow_config" {
  content  = module.metaflow.metaflow_profile_json
  filename = "./metaflow_profile.json"
}

Note: You can find a more complete example that uses this module but also includes setting up sagemaker notebooks and other non-Metaflow-specific parts of infra in this repo.

Modules

Name Source Version
metaflow-common ./modules/common n/a
metaflow-computation ./modules/computation n/a
metaflow-datastore ./modules/datastore n/a
metaflow-metadata-service ./modules/metadata-service n/a
metaflow-step-functions ./modules/step-functions n/a
metaflow-ui ./modules/ui n/a

Inputs

Name Description Type Default Required
access_list_cidr_blocks List of CIDRs we want to grant access to our Metaflow Metadata Service. Usually this is our VPN's CIDR blocks. list(string) [] no
batch_type AWS Batch Compute Type ('ec2', 'fargate') string "ec2" no
compute_environment_desired_vcpus Desired Starting VCPUs for Batch Compute Environment [0-16] for EC2 Batch Compute Environment (ignored for Fargate) number 8 no
compute_environment_egress_cidr_blocks CIDR blocks to which egress is allowed from the Batch Compute environment's security group list(string)
[
"0.0.0.0/0"
]
no
compute_environment_instance_types The instance types for the compute environment list(string)
[
"c4.large",
"c4.xlarge",
"c4.2xlarge",
"c4.4xlarge",
"c4.8xlarge"
]
no
compute_environment_max_vcpus Maximum VCPUs for Batch Compute Environment [16-96] number 64 no
compute_environment_min_vcpus Minimum VCPUs for Batch Compute Environment [0-16] for EC2 Batch Compute Environment (ignored for Fargate) number 8 no
db_engine_version n/a string "11" no
db_instance_type RDS instance type to launch for PostgresQL database. string "db.t2.small" no
db_migrate_lambda_zip_file Output path for the zip file containing the DB migrate lambda string null no
enable_custom_batch_container_registry Provisions infrastructure for custom Amazon ECR container registry if enabled bool false no
enable_key_rotation Enable key rotation for KMS keys bool false no
enable_step_functions Provisions infrastructure for step functions if enabled bool n/a yes
extra_ui_backend_env_vars Additional environment variables for UI backend container map(string) {} no
extra_ui_static_env_vars Additional environment variables for UI static app map(string) {} no
force_destroy_s3_bucket Empty S3 bucket before destroying via terraform destroy bool false no
iam_partition IAM Partition (Select aws-us-gov for AWS GovCloud, otherwise leave as is) string "aws" no
launch_template_http_endpoint Whether the metadata service is available. Can be 'enabled' or 'disabled' string "enabled" no
launch_template_http_put_response_hop_limit The desired HTTP PUT response hop limit for instance metadata requests. Can be an integer from 1 to 64 number 2 no
launch_template_http_tokens Whether or not the metadata service requires session tokens, also referred to as Instance Metadata Service Version 2 (IMDSv2). Can be 'optional' or 'required' string "optional" no
metadata_service_container_image Container image for metadata service string "" no
metadata_service_enable_api_basic_auth Enable basic auth for API Gateway? (requires key export) bool true no
metadata_service_enable_api_gateway Enable API Gateway for public metadata service endpoint bool true no
resource_prefix string prefix for all resources string "metaflow" no
resource_suffix string suffix for all resources string "" no
subnet1_id First subnet used for availability zone redundancy string n/a yes
subnet2_id Second subnet used for availability zone redundancy string n/a yes
tags aws tags map(string) n/a yes
ui_alb_internal Defines whether the ALB for the UI is internal bool false no
ui_allow_list List of CIDRs we want to grant access to our Metaflow UI Service. Usually this is our VPN's CIDR blocks. list(string) [] no
ui_certificate_arn SSL certificate for UI. If set to empty string, UI is disabled. string "" no
ui_static_container_image Container image for the UI frontend app string "" no
vpc_cidr_blocks The VPC CIDR blocks that we'll access list on our Metadata Service API to allow all internal communications list(string) n/a yes
vpc_id The id of the single VPC we stood up for all Metaflow resources to exist in. string n/a yes
with_public_ip Enable public IP assignment for the Metadata Service. If the subnets specified for subnet1_id and subnet2_id are public subnets, you will NEED to set this to true to allow pulling container images from public registries. Otherwise this should be set to false. bool n/a yes

Outputs

Name Description
METAFLOW_BATCH_JOB_QUEUE AWS Batch Job Queue ARN for Metaflow
METAFLOW_DATASTORE_SYSROOT_S3 Amazon S3 URL for Metaflow DataStore
METAFLOW_DATATOOLS_S3ROOT Amazon S3 URL for Metaflow DataTools
METAFLOW_ECS_S3_ACCESS_IAM_ROLE Role for AWS Batch to Access Amazon S3
METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE IAM role for Amazon EventBridge to access AWS Step Functions.
METAFLOW_SERVICE_INTERNAL_URL URL for Metadata Service (Accessible in VPC)
METAFLOW_SERVICE_URL URL for Metadata Service (Accessible in VPC)
METAFLOW_SFN_DYNAMO_DB_TABLE AWS DynamoDB table name for tracking AWS Step Functions execution metadata.
METAFLOW_SFN_IAM_ROLE IAM role for AWS Step Functions to access AWS resources (AWS Batch, AWS DynamoDB).
api_gateway_rest_api_id_key_id API Gateway Key ID for Metadata Service. Fetch Key from AWS Console [METAFLOW_SERVICE_AUTH_KEY]
batch_compute_environment_security_group_id The ID of the security group attached to the Batch Compute environment.
datastore_s3_bucket_kms_key_arn The ARN of the KMS key used to encrypt the Metaflow datastore S3 bucket
metadata_svc_ecs_task_role_arn n/a
metaflow_api_gateway_rest_api_id The ID of the API Gateway REST API we'll use to accept MetaData service requests to forward to the Fargate API instance
metaflow_batch_container_image The ECR repo containing the metaflow batch image
metaflow_profile_json Metaflow profile JSON object that can be used to communicate with this Metaflow Stack. Store this in ~/.metaflow/config_[stack-name] and select with $ export METAFLOW_PROFILE=[stack-name].
metaflow_s3_bucket_arn The ARN of the bucket we'll be using as blob storage
metaflow_s3_bucket_name The name of the bucket we'll be using as blob storage
migration_function_arn ARN of DB Migration Function
ui_alb_arn UI ALB ARN
ui_alb_dns_name UI ALB DNS name