sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

debuggerHook is not saving tensors in s3

Open tiru1930 opened this issue 5 years ago • 4 comments

Describe the bug

Exception during rule evaluation: Customer Error: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded | Exception during rule evaluation: Customer Error: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded

To reproduce Train FrameWork Xgboost with debugger hook as below

from sagemaker.xgboost import XGBoost
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, CollectionConfig

hyperparams = {"max_depth":5,
               "subsample":0.8,
               "num_round":600,
               "eta":0.2,
               "gamma":4,
               "min_child_weight":6,
               "silent":0,
               "objective":'multi:softmax',
               "num_class":len(le.classes_),
               "smdebug_path":f"s3://{bucket}/{prefix}/debug",
               "smdebug_collections":"metrics,feature_importance"
              }
save_interval = 5

entry_point_script = "xgboost_dest_prediction.py"

trial = Trial.create(trial_name="framework-mode-trial-{}".format(time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())), 
                     experiment_name=destination_prediction_experiment.experiment_name,
                     sagemaker_boto_client=boto3.client('sagemaker'))

framework_xgb = XGBoost(
                      entry_point=entry_point_script,
                      role=sagemaker.get_execution_role(),
                      framework_version='0.90-2',
                      py_version="py3",
                      hyperparameters=hyperparams,
                      instance_count=1, 
                      instance_type='ml.m4.xlarge',
                      output_path='s3://{}/{}/output'.format(bucket, prefix),
                      base_job_name="demo-xgboost-destination-prediction",
                      sagemaker_session=sm_sess,
#                       rules=debug_rules,
                      use_spot_instances = True,
                      max_run = 3600,
                      max_wait = 3600,
                      input_mode = 'File',
                      debugger_hook_config=DebuggerHookConfig(
                            s3_output_path=f"s3://{bucket}/{prefix}/debug",  # Required
                            collection_configs=[
                                CollectionConfig(
                                    name="metrics",
                                    parameters={
                                        "save_interval": str(save_interval)
                                    }
                                )
                            ],
                        ),

                      rules=[
                            Rule.sagemaker(
                                rule_configs.loss_not_decreasing(),
                                rule_parameters={
                                    "collection_names": "metrics",
                                    "num_steps": str(save_interval * 2),
                                },
                            ),
                        ],

                    )

framework_xgb.fit({'train': s3_input_train,
                   'validation': s3_input_validation}, 
                  experiment_config={
                      "ExperimentName": destination_prediction_experiment.experiment_name, 
                      "TrialName": trial.trial_name,
                      "TrialComponentDisplayName": "Training",
                  })

Expected behavior I should get tensors saved in s3

Screenshots or logs

[{'RuleConfigurationName': 'LossNotDecreasing',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:990360540682:processing-job/demo-xgboost-destination-p-lossnotdecreasing-abb2296f',
  'RuleEvaluationStatus': 'Error',
  'StatusDetails': 'ClientError: No debugging data was saved by the training job. Check that the debugger hook was configured correctly before starting the training job. Exception: Training job has ended. All the collection files could not be loaded\nTraceback (most recent call last):\n  File "evaluate.py", line 112, in _create_trials\n    range_steps=(self.start_step, self.end_step))\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/utils.py", line 20, in create_trial\n    return LocalTrial(name=name, dirname=path, **kwargs)\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/local_trial.py", line 36, in __init__\n    self._load_collections()\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 168, in _load_collections\n    _wait_for_collection_files(1)  # wait for the first collection file\n  File "/usr/local/lib/python3.7/site-packages/smdebug/trials/trial.py", line 165, in _wait_for_collection_files\n    raise MissingCollectionFiles\nsmdebug.exceptions.MissingCollectionFiles: Trainin',
  'LastModifiedTime': datetime.datetime(2020, 9, 18, 11, 6, 27, 290000, tzinfo=tzlocal())}]

System information SageMaker Python SDK version: 2.6 Framework name (eg. PyTorch) or algorithm (eg. KMeans): xgboost frame work Framework version: 0.90-2 Python version: 3.8 CPU or GPU: CPU Custom Docker image (Y/N): N

tiru1930 avatar Sep 18 '20 11:09 tiru1930

@tiru1930 Sorry that you run into this problem.

  • Is the Iam role you used has the proper permission setup to access the debug S3 bucket s3://{bucket}/{prefix}/debug
  • Could you show me the entire training job if possible?
  • How long does the training job run? Did it successfully generated a model?

icywang86rui avatar Sep 30 '20 16:09 icywang86rui

Is the Iam role you used has the proper permission setup to access the debug S3 bucket s3://{bucket}/{prefix}/debug Yes, Could you show me the entire training job if possible? do you mean logs?

How long does the training job run? Did it successfully generated a model? it ran for few mins, and yes, it generated model and stored in s3, i was able to deploy

tiru1930 avatar Oct 01 '20 13:10 tiru1930

Could you show me the entire training job if possible? do you mean logs?

Yes. sorry for the delay. are you still experiencing this problem?

icywang86rui avatar Oct 21 '20 16:10 icywang86rui

Hi, I'm having this issue as well. The model will train successfully and I can deploy it to an endpoint, but the "training_job_end.ts" file is empty.

Here's the estimator object I'm using:

xgb = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path=s3_xgb_output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparameters,
    debugger_hook_config=DebuggerHookConfig(
        s3_output_path=s3_xgb_output_location,
        collection_configs=[
            CollectionConfig(name="metrics", parameters={"save_interval": str(save_interval)}),
            CollectionConfig(
                name="feature_importance", parameters={"save_interval": str(save_interval)}
            ),
            CollectionConfig(name="full_shap", parameters={"save_interval": str(save_interval)}),
            CollectionConfig(name="average_shap", parameters={"save_interval": str(save_interval)}),
        ],
    ),
    rules=[
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            rule_parameters={
                "collection_names": "metrics",
                "num_steps": str(save_interval * 2),
            },
        ),
    ],                          
)

Then, I try to get the smdebug trial artifacts:

s3_output_path = xgb.latest_job_debugger_artifacts_path()
trial = smd.create_trial(s3_output_path)

Exception: MissingCollectionFiles: Training job has ended. All the collection files could not be loaded

craigbosco avatar Jan 09 '23 16:01 craigbosco