zenml icon indicating copy to clipboard operation
zenml copied to clipboard

Restrict AWS Sagemaker Instance Type Selection to Orchestrator Configuration

Open strickvl opened this issue 1 year ago • 9 comments

Open Source Contributors Welcomed!

Please comment below if you would like to work on this issue!

Contact Details [Optional]

[email protected]

What happened?

The configuration of the instance_type for AWS Sagemaker Orchestrator is currently determined by the developer/data scientist/ML engineer at the time of running the pipeline via the SagemakerOrchestratorSettings in code. This setup does not allow a DevOps Engineer or ML Engineer with an admin role to control or restrict the choice of instance types. This could lead to potential misuse, such as selecting excessively high-resource instances for trivial tasks or intentionally creating resource-intensive loops.

Task Description

Move the instance_type attribute from the SagemakerOrchestratorSettings in the code to the SagemakerOrchestrator config, which is set up during the component registration. This change will allow better control and governance over the resources used for running pipelines in AWS Sagemaker.

Expected Outcome

  • The instance_type should be configurable at the component registration level by an admin or a DevOps engineer.
  • Developers or data scientists should not be able to override the instance_type at the pipeline execution level.
  • The change should ensure better resource management and prevent potential misuse of AWS resources.

Steps to Implement

  • Update the SagemakerOrchestrator configuration to include the instance_type attribute.
  • Remove the instance_type option from the SagemakerOrchestratorSettings.
  • Ensure that the orchestrator respects the instance_type set during the component registration and does not allow overrides at runtime.
  • Update the documentation to reflect these changes.

Additional Context

This change is prompted by the need to enhance governance and control over resource utilization in cloud environments, particularly in team settings where multiple individuals have access to deploy pipelines.

Code of Conduct

  • [ ] I agree to follow this project's Code of Conduct

strickvl avatar Jan 03 '24 11:01 strickvl

Is there room for discussion regarding this idea?

We are currently benefiting greatly from the fact that it's allowed to provide instance_type via SagemakerOrchestratorSettings. Internally, we've defined a procedure where any AWS-based training run is discussed first, where at least four eyes give the approval that it can actually run in the cloud. Any other run is done with local resources. We're using a variety of instance types. In fact, if I am not incorrect, we will now need to define multiple stacks with multiple SageMaker Orchestrator components in order to use multiple instance types, which is quite cumbersome for us.

I do however understand the rationale for this issue very well especially for larger organizations, but could there be some middle ground? For example, something like this:

  • Use the SagemakerOrchestrator config to define a default instance_type.
  • Also add an other_instance_type_allowed: bool or similarly named option to the same configuration, which allows DevOps engineers / admins to decide whether people running ZenML pipelines can manually provide instance_type in SagemakerOrchestratorSettings.
  • If it's allowed, everything will keep working as is: use the default instance_type if none is provided; use the pipeline-specified one if another one is provided.
  • If it's disallowed, raise an error in case someone attempts to configure a disallowed instance_type (i.e. the case where SagemakerOrchestratorSettings.instance_type != SagemakerOrchestratorConfig.instance_type).

WDYT?

christianversloot avatar Jan 03 '24 14:01 christianversloot

I think I like the suggestion! It's a nice middle ground between flexibility + control over resource usage. It would be a new approach we haven't taken so far in how we allow components to be configured, and I'd be interested in @schustmi's thoughts on the approach particularly in the light of RBAC / permissions work he's been doing recently. I'm wondering if we should / should not consider this scenario with that in mind?

strickvl avatar Jan 03 '24 15:01 strickvl

I also like the suggestion, seems like a good compromise 👍

RBAC will control which users have permissions to update the stack component configuration but will not affect the Settings right now, so nothing important to consider there.

schustmi avatar Jan 05 '24 08:01 schustmi

I would like to work on this issue, if possible

AryaMoghaddam avatar Mar 27 '24 22:03 AryaMoghaddam

@strickvl Sorry, I picked up the other 2 issues wouldn't have time for this one

AryaMoghaddam avatar Apr 03 '24 15:04 AryaMoghaddam