feature request: @batch_sagemaker
We're looking to migrate away from a custom StepFunctions orchestration of spinning up SageMaker training jobs to using Metaflow, as that leads to much more digestible & easier to change code.
We like SageMaker training for some nice features they provide (spot interruption / artifact handling / metrics etc.) and migrating those would be too costly for now.
AWS now supports triggering SageMaker jobs from an AWS Batch queue. However, submitting jobs must use a separate SubmitServiceJob API (vs the SubmitJob used for Fargate/EC2 jobs). The payload the API expects is also basically the usual SageMaker's CreateTrainingJob input.
@straygar can you help us with more details on this integration? are you hoping for symmetric functionality to @batch? if so, what is the additional benefit that running the job on sagemaker offers (there is a 20-40% increase in instance cost)? if it's something else, i would love to understand the use case better.