sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

SageMaker SDK Bug Report: HyperparameterTuner Missing Container Mode Support

Open josh-gree opened this issue 7 months ago • 5 comments

Describe the bug HyperparameterTuner does not preserve container mode parameters (container_entry_point and container_arguments) when creating training jobs, causing tuning jobs to fail. Individual training jobs work correctly with container mode, but hyperparameter tuning jobs lose the container configuration and fall back to script mode logic, resulting in failures.

To reproduce

from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter

# Create estimator with container mode
estimator = Estimator(
    image_uri="123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image:latest",
    role="arn:aws:iam::123456789012:role/SageMakerRole",
    instance_type="ml.m5.large",
    instance_count=1,
    container_entry_point=["python", "-m", "my_module"],
    container_arguments=["train", "model1"],
)

# Create hyperparameter tuner
tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name="accuracy",
    objective_type="Maximize",
    hyperparameter_ranges={
        "learning_rate": ContinuousParameter(0.001, 0.1)
    },
    max_jobs=2,
    max_parallel_jobs=1,
)

# This will fail - individual training jobs missing container parameters
tuner.fit()

Expected behavior The hyperparameter tuning job should preserve the container mode configuration and set ContainerEntrypoint and ContainerArguments in the AlgorithmSpecification of individual training jobs, just like when calling estimator.fit() directly.

Screenshots or logs Individual training job within tuning job shows missing container parameters:

"AlgorithmSpecification": {
    "TrainingImage": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image:latest",
    "TrainingInputMode": "File",
    "MetricDefinitions": [...],
    "EnableSageMakerMetricsTimeSeries": false
    // Missing: ContainerEntrypoint and ContainerArguments
}

Training jobs fail with:

AlgorithmError: Framework Error: 
AttributeError: 'NoneType' object has no attribute 'endswith'

System information

  • SageMaker Python SDK version: 2.244.2
  • Framework name: Custom container (Estimator class)
  • Framework version: N/A
  • Python version: 3.10
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Y

Additional context

Root cause analysis

The issue is in two locations in the SDK:

1. sagemaker/job.py - Missing container parameter extraction

_Job._load_config() method (lines 117-124) only extracts basic configuration and ignores container mode parameters:

return {
    "input_config": input_config,
    "role": role,
    "output_config": output_config,
    "resource_config": resource_config,
    "stop_condition": stop_condition,
    "vpc_config": vpc_config,
    # Missing: container_entry_point, container_arguments
}

2. sagemaker/session.py - Missing container parameter handling

_map_training_config() method (line 3584+) doesn't accept container parameters in its signature and doesn't include them in the AlgorithmSpecification (lines 3685-3694).

The method signature is missing container_entry_point and container_arguments parameters, and the AlgorithmSpecification construction only includes:

algorithm_spec = {"TrainingInputMode": input_mode}
if metric_definitions is not None:
    algorithm_spec["MetricDefinitions"] = metric_definitions

if algorithm_arn:
    algorithm_spec["AlgorithmName"] = algorithm_arn
else:
    algorithm_spec["TrainingImage"] = image_uri

# Missing: ContainerEntrypoint and ContainerArguments

Comparison with working code

Individual training jobs work because session.train() correctly handles container parameters (lines 1266-1270):

if container_entry_point is not None:
    train_request["AlgorithmSpecification"]["ContainerEntrypoint"] = container_entry_point

if container_arguments is not None:
    train_request["AlgorithmSpecification"]["ContainerArguments"] = container_arguments

Code path analysis

Working path (individual training jobs):

  1. estimator.fit()session.train() → ✅ Includes container parameters

Broken path (hyperparameter tuning):

  1. tuner.fit()_TuningJob._prepare_training_config()
  2. _Job._load_config() → ❌ Drops container parameters
  3. session._map_training_config() → ❌ Doesn't handle container parameters

Verification

  • ✅ Container mode works with estimator.fit() (individual training jobs)
  • ❌ Container mode fails with tuner.fit() (hyperparameter tuning)
  • ✅ Script mode works with tuner.fit()

Impact

This prevents users from using container mode with hyperparameter tuning, forcing them to use script mode for tuning jobs even when their training logic is containerized.

Suggested fix

  1. Update _Job._load_config() to extract container parameters from the estimator:
# Add to the return dict:
config = {
    "input_config": input_config,
    "role": role,
    "output_config": output_config,
    "resource_config": resource_config,
    "stop_condition": stop_condition,
    "vpc_config": vpc_config,
}

# Add container mode parameters
if hasattr(estimator, 'container_entry_point') and estimator.container_entry_point:
    config['container_entry_point'] = estimator.container_entry_point
    
if hasattr(estimator, 'container_arguments') and estimator.container_arguments:
    config['container_arguments'] = estimator.container_arguments

return config
  1. Update _map_training_config() signature to accept container parameters and include them in AlgorithmSpecification:
def _map_training_config(
    cls,
    static_hyperparameters,
    input_mode,
    role,
    output_config,
    stop_condition,
    # ... existing params ...
    container_entry_point=None,  # Add this
    container_arguments=None,    # Add this
):
    # ... existing code ...
    
    # Add to AlgorithmSpecification:
    if container_entry_point is not None:
        algorithm_spec["ContainerEntrypoint"] = container_entry_point
        
    if container_arguments is not None:
        algorithm_spec["ContainerArguments"] = container_arguments

This would align the hyperparameter tuning code path with the working individual training job implementation.

josh-gree avatar May 25 '25 18:05 josh-gree

Hi @josh-gree Have you explored the new interface ModelTrainer which is an upgrade to the Estimator ?

https://sagemaker.readthedocs.io/en/stable/api/training/model_trainer.html https://aws.amazon.com/blogs/machine-learning/accelerate-your-ml-lifecycle-using-the-new-and-improved-amazon-sagemaker-python-sdk-part-1-modeltrainer/

The container entrypoint info is gotten through the source_code parameter.

Please take a look at these and let us know if this is still a gap.

nargokul avatar May 28 '25 18:05 nargokul

@nargokul I have not seen that this exists - should I take this to mean that the Estimator approach is basically no longer supported? What does this mean for HyperparameterTuner which takes an estimator as input?

josh-gree avatar May 28 '25 20:05 josh-gree

We'd recommend using ModelTrainer. ModelTrainer as the next-generation interface for the SageMaker Python SDK, and it’s designed to address many of the pain points you may have experienced with the traditional Estimator approach: Key Benefits You’ll Experience: • Simplified API Design: More intuitive and consistent interface that reduces boilerplate code • Enhanced Flexibility: Better support for modern ML frameworks and custom training scenarios • Improved Performance: Optimized for faster training job setup and execution • Future-Proof: All new SageMaker features and optimizations will be built on ModelTrainer first Migration Path While Estimator remains fully supported in SDK v2, ModelTrainer represents AWS’s strategic direction. By adopting it now, you’ll: • Stay ahead of the curve with the latest capabilities • Benefit from ongoing performance improvements and new features • Ensure your codebase aligns with AWS best practices • Reduce technical debt as the ecosystem evolves

Also along with this, we also have sagemaker-core library , which is a lower level SDK. https://sagemaker-core.readthedocs.io/en/stable/ https://aws.amazon.com/blogs/machine-learning/introducing-sagemaker-core-a-new-object-oriented-python-sdk-for-amazon-sagemaker/

For your case for HyperParameter Job creation , the HyperParameterTuningJob.create() should replace the implementation that is in HyperparameterTuner currently

https://github.com/aws/sagemaker-core/blob/main/src/sagemaker_core/main/resources.py#L13454

nargokul avatar Jun 02 '25 23:06 nargokul

As for the actual support for the parameters, looks like these parameters container_entrypoint and container_arguments are not supported at the API level. https://boto3.amazonaws.com/v1/documentation/api/1.26.85/reference/services/sagemaker/client/create_hyper_parameter_tuning_job.html.

Will check with the API team and keep this thread posted

nargokul avatar Jun 02 '25 23:06 nargokul

Are there any updates on this? The parameters container_entrypoint and container_arguments are supported in AlgorithmSpecification for CreateTrainingJob, but are not supported in the equivalent HyperParameterAlgorithmSpecification for CreateHyperParameterTuningJob. This makes hyperparameter tuning impossible when using container mode.

jonathanTaiv avatar Dec 05 '25 20:12 jonathanTaiv