sagemaker-debugger icon indicating copy to clipboard operation
sagemaker-debugger copied to clipboard

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors

Results 92 sagemaker-debugger issues
Sort by recently updated
recently updated
newest added

### Description of changes: Refactored all exception types under the `smdebug/mxnet`, `smdebug/pytorch`, and `smdebug/tensorflow` directories to return the `SMDebugError` exception type. Also changed the logic of the end of training...

### Description of changes: readthedocs build log: https://readthedocs.org/projects/sagemaker-debugger/builds/14082688/ pre-launched doc: https://sagemaker-debugger.readthedocs.io/en/website/ #### Style and formatting: I have run `pre-commit install` to ensure that auto-formatting happens with every commit. #### Issue...

### Description of changes: Extended smdebug's reductions to check for nan- and inf-values and to compute quantiles for PT tensors. Tensors are now also written out in Tensorboard format such...

After countless hours of trying to get an `Estimator()` to run on a custom `image_uri` in smdistributed/dataparllel mode (it was failing on trying to import any non-sagemaker-DLC library), I finally...

### Description of changes: I reproduce the error by running ``` nvidia-docker run -it http://763104351884.dkr.ecr.us-east-1.amazonaws.com/autogluon-training:0.4.2-gpu-py38-cu112-ubuntu20.04 python3 import horovod.torch ``` It gives me the following warning. ``` Extension horovod.torch has not...

import horovod.torch does not raise exception even if it is not fully successful. Add init() function to catch it.

### Description of changes: Resolves `SyntaxWarning: "is not" with a literal. Did you mean "!="?` warning introduced in Python 3.8. Identity string comparison replaced with equality comparison. This warning appears...

The [test_pytorch[False-False]](https://github.com/awslabs/sagemaker-debugger/blob/v1.0.13/tests/zero_code_change/test_pytorch_integration.py#L29) fails for Pytorch >=1.7. The test however works for [True-False] combination. The test is integrated into the [DLC test suite](https://github.com/aws/deep-learning-containers/blob/master/test/dlc_tests/container_tests/bin/testSmprofiler#L114) and reports failure for the [False-False] combination.

Hi, I really wish I could use `TensorBoardOutputConfig` when my SageMaker Estimator instance type is `local` or `local_gpu`. This could be accomplished by either or both of: - Enabling streaming...

When retrieving `full_shap` values, debugger returns a matrix in the shape of `number of training samples, number of features` e.g. ``` for index,i in enumerate(trial.tensor_names(regex='full_shap')): tensor = trial.tensor(i).value(step_num=50) print(i, tensor.shape)...