zenml
zenml copied to clipboard
Restrict AWS Sagemaker Instance Type Selection to Orchestrator Configuration
Open Source Contributors Welcomed!
Please comment below if you would like to work on this issue!
Contact Details [Optional]
What happened?
The configuration of the instance_type for AWS Sagemaker Orchestrator is currently determined by the developer/data scientist/ML engineer at the time of running the pipeline via the SagemakerOrchestratorSettings in code. This setup does not allow a DevOps Engineer or ML Engineer with an admin role to control or restrict the choice of instance types. This could lead to potential misuse, such as selecting excessively high-resource instances for trivial tasks or intentionally creating resource-intensive loops.
Task Description
Move the instance_type attribute from the SagemakerOrchestratorSettings in the code to the SagemakerOrchestrator config, which is set up during the component registration. This change will allow better control and governance over the resources used for running pipelines in AWS Sagemaker.
Expected Outcome
- The
instance_typeshould be configurable at the component registration level by an admin or a DevOps engineer. - Developers or data scientists should not be able to override the instance_type at the pipeline execution level.
- The change should ensure better resource management and prevent potential misuse of AWS resources.
Steps to Implement
- Update the
SagemakerOrchestratorconfiguration to include theinstance_typeattribute. - Remove the
instance_typeoption from theSagemakerOrchestratorSettings. - Ensure that the orchestrator respects the
instance_typeset during the component registration and does not allow overrides at runtime. - Update the documentation to reflect these changes.
Additional Context
This change is prompted by the need to enhance governance and control over resource utilization in cloud environments, particularly in team settings where multiple individuals have access to deploy pipelines.
Code of Conduct
- [ ] I agree to follow this project's Code of Conduct
Is there room for discussion regarding this idea?
We are currently benefiting greatly from the fact that it's allowed to provide instance_type via SagemakerOrchestratorSettings. Internally, we've defined a procedure where any AWS-based training run is discussed first, where at least four eyes give the approval that it can actually run in the cloud. Any other run is done with local resources. We're using a variety of instance types. In fact, if I am not incorrect, we will now need to define multiple stacks with multiple SageMaker Orchestrator components in order to use multiple instance types, which is quite cumbersome for us.
I do however understand the rationale for this issue very well especially for larger organizations, but could there be some middle ground? For example, something like this:
- Use the
SagemakerOrchestratorconfig to define a defaultinstance_type. - Also add an
other_instance_type_allowed: boolor similarly named option to the same configuration, which allows DevOps engineers / admins to decide whether people running ZenML pipelines can manually provideinstance_typeinSagemakerOrchestratorSettings. - If it's allowed, everything will keep working as is: use the default
instance_typeif none is provided; use the pipeline-specified one if another one is provided. - If it's disallowed, raise an error in case someone attempts to configure a disallowed
instance_type(i.e. the case whereSagemakerOrchestratorSettings.instance_type != SagemakerOrchestratorConfig.instance_type).
WDYT?
I think I like the suggestion! It's a nice middle ground between flexibility + control over resource usage. It would be a new approach we haven't taken so far in how we allow components to be configured, and I'd be interested in @schustmi's thoughts on the approach particularly in the light of RBAC / permissions work he's been doing recently. I'm wondering if we should / should not consider this scenario with that in mind?
I also like the suggestion, seems like a good compromise 👍
RBAC will control which users have permissions to update the stack component configuration but will not affect the Settings right now, so nothing important to consider there.
I would like to work on this issue, if possible
@strickvl Sorry, I picked up the other 2 issues wouldn't have time for this one