terraform-aws-metaflow
terraform-aws-metaflow copied to clipboard
Deploy production-grade Metaflow cloud infrastructure on AWS
Metaflow Terraform module
Terraform module that provisions AWS resources to run Metaflow in production.
This module consists of submodules that can be used separately as well:
- AWS Batch cluster to run Metaflow steps (
metaflow-computation
) - blob storage and metadata database (
metaflow-datastore
) - a service providing API to record and query past executions (
metaflow-metadata-service
) - resources to deploy Metaflow flows on Step Functions processing (
metaflow-step-functions
) - Metaflow UI(
metaflow-ui
)
You can either use this high-level module, or submodules individually. See each submodule's corresponding README.md
for more details.
Here's a minimal end-to-end example of using this module with VPC:
# Random suffix for this deployment
resource "random_string" "suffix" {
length = 8
special = false
upper = false
}
locals {
resource_prefix = "metaflow"
resource_suffix = random_string.suffix.result
}
data "aws_availability_zones" "available" {
}
# VPC infra using https://github.com/terraform-aws-modules/terraform-aws-vpc
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "3.13.0"
name = "${local.resource_prefix}-${local.resource_suffix}"
cidr = "10.10.0.0/16"
azs = data.aws_availability_zones.available.names
private_subnets = ["10.10.8.0/21", "10.10.16.0/21", "10.10.24.0/21"]
public_subnets = ["10.10.128.0/21", "10.10.136.0/21", "10.10.144.0/21"]
enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true
}
module "metaflow" {
source = "outerbounds/metaflow/aws"
version = "0.3.0"
resource_prefix = local.resource_prefix
resource_suffix = local.resource_suffix
enable_step_functions = false
subnet1_id = module.vpc.public_subnets[0]
subnet2_id = module.vpc.public_subnets[1]
vpc_cidr_blocks = module.vpc.vpc_cidr_blocks
vpc_id = module.vpc.vpc_id
with_public_ip = true
tags = {
"managedBy" = "terraform"
}
}
# export all outputs from metaflow modules
output "metaflow" {
value = module.metaflow
}
# The module will generate a Metaflow config in JSON format, write it to a file
resource "local_file" "metaflow_config" {
content = module.metaflow.metaflow_profile_json
filename = "./metaflow_profile.json"
}
Note: You can find a more complete example that uses this module but also includes setting up sagemaker notebooks and other non-Metaflow-specific parts of infra in this repo.
Modules
Name | Source | Version |
---|---|---|
metaflow-common | ./modules/common | n/a |
metaflow-computation | ./modules/computation | n/a |
metaflow-datastore | ./modules/datastore | n/a |
metaflow-metadata-service | ./modules/metadata-service | n/a |
metaflow-step-functions | ./modules/step-functions | n/a |
metaflow-ui | ./modules/ui | n/a |
Inputs
Name | Description | Type | Default | Required |
---|---|---|---|---|
access_list_cidr_blocks | List of CIDRs we want to grant access to our Metaflow Metadata Service. Usually this is our VPN's CIDR blocks. | list(string) |
[] |
no |
batch_type | AWS Batch Compute Type ('ec2', 'fargate') | string |
"ec2" |
no |
compute_environment_desired_vcpus | Desired Starting VCPUs for Batch Compute Environment [0-16] for EC2 Batch Compute Environment (ignored for Fargate) | number |
8 |
no |
compute_environment_egress_cidr_blocks | CIDR blocks to which egress is allowed from the Batch Compute environment's security group | list(string) |
[ |
no |
compute_environment_instance_types | The instance types for the compute environment | list(string) |
[ |
no |
compute_environment_max_vcpus | Maximum VCPUs for Batch Compute Environment [16-96] | number |
64 |
no |
compute_environment_min_vcpus | Minimum VCPUs for Batch Compute Environment [0-16] for EC2 Batch Compute Environment (ignored for Fargate) | number |
8 |
no |
db_engine_version | n/a | string |
"11" |
no |
db_instance_type | RDS instance type to launch for PostgresQL database. | string |
"db.t2.small" |
no |
db_migrate_lambda_zip_file | Output path for the zip file containing the DB migrate lambda | string |
null |
no |
enable_custom_batch_container_registry | Provisions infrastructure for custom Amazon ECR container registry if enabled | bool |
false |
no |
enable_key_rotation | Enable key rotation for KMS keys | bool |
false |
no |
enable_step_functions | Provisions infrastructure for step functions if enabled | bool |
n/a | yes |
extra_ui_backend_env_vars | Additional environment variables for UI backend container | map(string) |
{} |
no |
extra_ui_static_env_vars | Additional environment variables for UI static app | map(string) |
{} |
no |
force_destroy_s3_bucket | Empty S3 bucket before destroying via terraform destroy | bool |
false |
no |
iam_partition | IAM Partition (Select aws-us-gov for AWS GovCloud, otherwise leave as is) | string |
"aws" |
no |
launch_template_http_endpoint | Whether the metadata service is available. Can be 'enabled' or 'disabled' | string |
"enabled" |
no |
launch_template_http_put_response_hop_limit | The desired HTTP PUT response hop limit for instance metadata requests. Can be an integer from 1 to 64 | number |
2 |
no |
launch_template_http_tokens | Whether or not the metadata service requires session tokens, also referred to as Instance Metadata Service Version 2 (IMDSv2). Can be 'optional' or 'required' | string |
"optional" |
no |
metadata_service_container_image | Container image for metadata service | string |
"" |
no |
metadata_service_enable_api_basic_auth | Enable basic auth for API Gateway? (requires key export) | bool |
true |
no |
metadata_service_enable_api_gateway | Enable API Gateway for public metadata service endpoint | bool |
true |
no |
resource_prefix | string prefix for all resources | string |
"metaflow" |
no |
resource_suffix | string suffix for all resources | string |
"" |
no |
subnet1_id | First subnet used for availability zone redundancy | string |
n/a | yes |
subnet2_id | Second subnet used for availability zone redundancy | string |
n/a | yes |
tags | aws tags | map(string) |
n/a | yes |
ui_alb_internal | Defines whether the ALB for the UI is internal | bool |
false |
no |
ui_allow_list | List of CIDRs we want to grant access to our Metaflow UI Service. Usually this is our VPN's CIDR blocks. | list(string) |
[] |
no |
ui_certificate_arn | SSL certificate for UI. If set to empty string, UI is disabled. | string |
"" |
no |
ui_static_container_image | Container image for the UI frontend app | string |
"" |
no |
vpc_cidr_blocks | The VPC CIDR blocks that we'll access list on our Metadata Service API to allow all internal communications | list(string) |
n/a | yes |
vpc_id | The id of the single VPC we stood up for all Metaflow resources to exist in. | string |
n/a | yes |
with_public_ip | Enable public IP assignment for the Metadata Service. If the subnets specified for subnet1_id and subnet2_id are public subnets, you will NEED to set this to true to allow pulling container images from public registries. Otherwise this should be set to false. | bool |
n/a | yes |
Outputs
Name | Description |
---|---|
METAFLOW_BATCH_JOB_QUEUE | AWS Batch Job Queue ARN for Metaflow |
METAFLOW_DATASTORE_SYSROOT_S3 | Amazon S3 URL for Metaflow DataStore |
METAFLOW_DATATOOLS_S3ROOT | Amazon S3 URL for Metaflow DataTools |
METAFLOW_ECS_S3_ACCESS_IAM_ROLE | Role for AWS Batch to Access Amazon S3 |
METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE | IAM role for Amazon EventBridge to access AWS Step Functions. |
METAFLOW_SERVICE_INTERNAL_URL | URL for Metadata Service (Accessible in VPC) |
METAFLOW_SERVICE_URL | URL for Metadata Service (Accessible in VPC) |
METAFLOW_SFN_DYNAMO_DB_TABLE | AWS DynamoDB table name for tracking AWS Step Functions execution metadata. |
METAFLOW_SFN_IAM_ROLE | IAM role for AWS Step Functions to access AWS resources (AWS Batch, AWS DynamoDB). |
api_gateway_rest_api_id_key_id | API Gateway Key ID for Metadata Service. Fetch Key from AWS Console [METAFLOW_SERVICE_AUTH_KEY] |
batch_compute_environment_security_group_id | The ID of the security group attached to the Batch Compute environment. |
datastore_s3_bucket_kms_key_arn | The ARN of the KMS key used to encrypt the Metaflow datastore S3 bucket |
metadata_svc_ecs_task_role_arn | n/a |
metaflow_api_gateway_rest_api_id | The ID of the API Gateway REST API we'll use to accept MetaData service requests to forward to the Fargate API instance |
metaflow_batch_container_image | The ECR repo containing the metaflow batch image |
metaflow_profile_json | Metaflow profile JSON object that can be used to communicate with this Metaflow Stack. Store this in ~/.metaflow/config_[stack-name] and select with $ export METAFLOW_PROFILE=[stack-name] . |
metaflow_s3_bucket_arn | The ARN of the bucket we'll be using as blob storage |
metaflow_s3_bucket_name | The name of the bucket we'll be using as blob storage |
migration_function_arn | ARN of DB Migration Function |
ui_alb_arn | UI ALB ARN |
ui_alb_dns_name | UI ALB DNS name |