community icon indicating copy to clipboard operation
community copied to clipboard

Generate unique name for training job resources

Open dfarr opened this issue 2 years ago • 4 comments

If we create the following sagemaker resource it will silently fail if a training job with the name test-training-job already exists.

apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: TrainingJob
metadata:
  generateName: test-training-job
spec:
  trainingJobName: test-training-job

It would be nice if the controller had a generateName-like functionality to ensure a unique name. We create TrainingJobs through an Argo Workflow (the workflow submits the k8s manifest), on failure we will retry n times. However we cannot update trainingJobName field in the manifest on each retry, ensuring that the second attempt will always fail as a training job with the same name already exists.

dfarr avatar May 18 '22 23:05 dfarr

@dfarr The problem with controller-generated names is that they essentially make the resource impossible to manage using a fully declarative, GitOps-style configuration management system... Because the Spec for the resource with a generated name doesn't actually represent the desired state of the resource, but rather a template of a desired state of a resource (or multiple resources). So, for example, the GitOps controller cannot tell when the desired state of a resource has changed because it doesn't know what the actual name of the resource is, only the generated name template for an instance of that resource type...

This is the problem with imperative-style APIs like SageMaker conflicting with declarative-style APIs like Kubernetes. It's almost like we need to create a separate TrainingJobTemplate resource with support for generated names and have the SageMaker ACK controller treat those resources in the same way that the Kubernetes built-in Deployment controller treats the Spec.Template for Pods in the Deployment...

jaypipes avatar May 21 '22 04:05 jaypipes

A generateName like name could be resolved before the resource hits the SageMaker ACK controller with a mutating webhook. Once the name has been set it will not change so the ACK controller would only ever be aware of the static name field.

Kubernetes itself does this with resources that support metadata.generateName. For example, if I create the following pod

apiVersion: v1
kind: Pod
metadata:
  generateName: test-
spec:
  containers:
  - name: main
    image: python:3.7
    command: [sleep, '999']

And then do kubectl get -o yaml pod/test-8x6m2, I will see the fully resolved metadata.name field as part of the spec.

dfarr avatar Jun 08 '22 19:06 dfarr

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle stale

eks-bot avatar Sep 06 '22 22:09 eks-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle rotten

eks-bot avatar Oct 06 '22 22:10 eks-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Provide feedback via https://github.com/aws-controllers-k8s/community. /close

eks-bot avatar Nov 05 '22 22:11 eks-bot

@eks-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Provide feedback via https://github.com/aws-controllers-k8s/community. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ack-bot avatar Nov 05 '22 22:11 ack-bot