amazon-sagemaker-operator-for-k8s
amazon-sagemaker-operator-for-k8s copied to clipboard
Why blocking HPOJobs with "not having enough combinations of hyperparameter ranges"?
We saw HPOJobs stuck in Reconciling
state because of not enough combinations of hyperparmeter ranges
:
Unable to create HyperParameter Tuning Job: ValidationException: You don't have enough combinations of hyperparameter ranges. The total number of hyperparameter combinations for the provided ranges [3.0] must be equal or greater than the value of MaxNumberOfJobs, [10]. Provide additional ranges."
It is a bit confusing to see the semantic of MaxNumberOfJob
enforces a lower bound on the number of combinations, and I feel this doesn't seem like a status where the job should be held in Reconciling state. May I ask what's the reason to block such jobs?
Hi,
The ValidationException is coming from SageMaker when using parameterRanges
in the hyperParameterTuningJobConfig
.
semantic of MaxNumberOfJob enforces a lower bound on the number of combinations
What type of parameter ranges and their scale is being used when you when you see this error ?
doesn't seem like a status where the job should be held in Reconciling state
Jobs stuck in reconciling state after Validation error occurs looks like an issue. I will try to reproduce it on our end. Please provide us with minimum reproducible sample input.
Thanks
I was able to replicate the job stuck in ReconcilingTuningJob
status with the below job definition
apiVersion: sagemaker.aws.amazon.com/v1
kind: HyperparameterTuningJob
metadata:
name: kmeans-mnist-hpo-3
spec:
region: us-east-1
hyperParameterTuningJobConfig:
strategy: Bayesian
hyperParameterTuningJobObjective:
type: Minimize
metricName: test:msd
resourceLimits:
maxNumberOfTrainingJobs: 10
maxParallelTrainingJobs: 5
parameterRanges:
categoricalParameterRanges:
- name: init_method
values:
- 'random'
- 'kmeans++'
trainingJobDefinition:
staticHyperParameters:
- name: k
value: '10'
- name: feature_dim
value: '784'
algorithmSpecification:
trainingImage: 382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:1
trainingInputMode: File
roleArn: <REPLACE_ME>
inputDataConfig:
- channelName: train
dataSource:
s3DataSource:
s3DataType: S3Prefix
s3Uri: s3://<REPLACE_ME>/mnist_kmeans_example/train_data/
s3DataDistributionType: FullyReplicated
compressionType: None
recordWrapperType: None
inputMode: File
outputDataConfig:
s3OutputPath: s3://<REPLACE_ME>/mnist_kmeans_example/output
resourceConfig:
instanceType: ml.m4.xlarge
instanceCount: 1
volumeSizeInGB: 25
stoppingCondition:
maxRuntimeInSeconds: 3600
Hi @surajkota! Sorry for the late reply.
What type of parameter ranges and their scale is being used when you when you see this error ?
So in my case, the validation failed when I set the maxNumberOfTrainingJob
to 10
and only have, say, an integer parameter range (e.g., num_round
) from 1 to 3. The ScalingType was Linear
. Since you
I was able to replicate the job stuck in ReconcilingTuningJob status with the below job definition
In your example it seems that you are encountering ReconcilingTuningJob
because there are only two possible training job configurations: init_method = ['random', 'kmeans++']
and you have maxNumberOfTrainingJobs=10
. This aligns with my experience.
What I am trying to get at in this issue is that, even if the number of possible number of configurations is smaller than maxNumberOfTrainingJobs
, SageMaker should still let the job proceed. maxNumberOfTrainingJobs
should enforce only an upper limit on the number of training jobs that will be launched when the total number of possibilities is larger; its semantic should not enforce that the hpo job needs to have at least maxNumberOfTrainingJobs
of training jobs.
I hope that makes sense :)
Please use the latest version of SageMaker Operator - https://github.com/aws/amazon-sagemaker-operator-for-k8s#migrate-resources-to-the-new-sagemaker-operators-for-kubernetes