amazon-sagemaker-operator-for-k8s icon indicating copy to clipboard operation
amazon-sagemaker-operator-for-k8s copied to clipboard

Why blocking HPOJobs with "not having enough combinations of hyperparameter ranges"?

Open bnsblue opened this issue 3 years ago • 3 comments

We saw HPOJobs stuck in Reconciling state because of not enough combinations of hyperparmeter ranges:

Unable to create HyperParameter Tuning Job: ValidationException: You don't have enough combinations of hyperparameter ranges. The total number of hyperparameter combinations for the provided ranges [3.0] must be equal or greater than the value of MaxNumberOfJobs, [10]. Provide additional ranges." 

It is a bit confusing to see the semantic of MaxNumberOfJob enforces a lower bound on the number of combinations, and I feel this doesn't seem like a status where the job should be held in Reconciling state. May I ask what's the reason to block such jobs?

bnsblue avatar Jul 15 '20 22:07 bnsblue

Hi,

The ValidationException is coming from SageMaker when using parameterRanges in the hyperParameterTuningJobConfig.

semantic of MaxNumberOfJob enforces a lower bound on the number of combinations

What type of parameter ranges and their scale is being used when you when you see this error ?

doesn't seem like a status where the job should be held in Reconciling state

Jobs stuck in reconciling state after Validation error occurs looks like an issue. I will try to reproduce it on our end. Please provide us with minimum reproducible sample input.

Thanks

surajkota avatar Jul 16 '20 00:07 surajkota

I was able to replicate the job stuck in ReconcilingTuningJob status with the below job definition

apiVersion: sagemaker.aws.amazon.com/v1
kind: HyperparameterTuningJob
metadata:
  name: kmeans-mnist-hpo-3
spec:
  region: us-east-1
  hyperParameterTuningJobConfig:
    strategy: Bayesian
    hyperParameterTuningJobObjective:
      type: Minimize
      metricName: test:msd
    resourceLimits:
      maxNumberOfTrainingJobs: 10
      maxParallelTrainingJobs: 5
    parameterRanges:
      categoricalParameterRanges:
      - name: init_method
        values:
        - 'random'
        - 'kmeans++'
  trainingJobDefinition:
    staticHyperParameters:
      - name: k
        value: '10'
      - name: feature_dim
        value: '784'
    algorithmSpecification:
      trainingImage: 382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:1
      trainingInputMode: File
    roleArn: <REPLACE_ME>
    inputDataConfig:
    - channelName: train
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          s3Uri: s3://<REPLACE_ME>/mnist_kmeans_example/train_data/
          s3DataDistributionType: FullyReplicated
      compressionType: None
      recordWrapperType: None
      inputMode: File
    outputDataConfig:
      s3OutputPath: s3://<REPLACE_ME>/mnist_kmeans_example/output
    resourceConfig:
      instanceType: ml.m4.xlarge
      instanceCount: 1
      volumeSizeInGB: 25
    stoppingCondition:
      maxRuntimeInSeconds: 3600

surajkota avatar Jul 22 '20 01:07 surajkota

Hi @surajkota! Sorry for the late reply.

What type of parameter ranges and their scale is being used when you when you see this error ?

So in my case, the validation failed when I set the maxNumberOfTrainingJob to 10 and only have, say, an integer parameter range (e.g., num_round) from 1 to 3. The ScalingType was Linear. Since you

I was able to replicate the job stuck in ReconcilingTuningJob status with the below job definition

In your example it seems that you are encountering ReconcilingTuningJob because there are only two possible training job configurations: init_method = ['random', 'kmeans++'] and you have maxNumberOfTrainingJobs=10. This aligns with my experience.

What I am trying to get at in this issue is that, even if the number of possible number of configurations is smaller than maxNumberOfTrainingJobs, SageMaker should still let the job proceed. maxNumberOfTrainingJobs should enforce only an upper limit on the number of training jobs that will be launched when the total number of possibilities is larger; its semantic should not enforce that the hpo job needs to have at least maxNumberOfTrainingJobs of training jobs.

I hope that makes sense :)

bnsblue avatar Aug 09 '20 06:08 bnsblue

Please use the latest version of SageMaker Operator - https://github.com/aws/amazon-sagemaker-operator-for-k8s#migrate-resources-to-the-new-sagemaker-operators-for-kubernetes

surajkota avatar Jul 19 '23 21:07 surajkota