sagemaker-debugger icon indicating copy to clipboard operation
sagemaker-debugger copied to clipboard

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors

Results 92 sagemaker-debugger issues
Sort by recently updated
recently updated
newest added

Q. when creating a custom collection, is there a way to define EVAL/TRAIN save_interval directly in the SageMaker Estimator? ANS: Yes, it can be provided, for details see this section...

documentation
wontfix
FAQ

Running the following script with tensorflow==1.15.0: ``` import tensorflow.compat.v2 as tf import smdebug.tensorflow as smd from tempfile import TemporaryDirectory mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test...

In CI : https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=DO-NOT-DELETE-smdebug_rules-LOGS-ONE-REPO;stream=codebuild/c3bda538-9277-42db-931a-de5984013923;filter=%22Loaded%20Index%20Files:%20upload/20200106_221841/c33ae10/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578351365.7939517/index/000000000/000000000070_worker_0.json%22 Why is this line repeated so many times: "Loaded Index Files: upload/20200106_221841/c33ae10/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578351365.7939517/index/000000000/000000000070_worker_0.json" Are we reloading index files again and again ? @NihalHarish Please check and confirm

Come up with a way so that CI prints the running time for each tests. Find what integration tests are running longer and optimize them to make them run fast....

If I use the script tf_simple.py and use monitoredSession(hook) , I see in-consistent behavior. Link to script - https://gist.github.com/Vikas-kum/a726aa05f70cbc22da55aac6f9f122d2 Repro - Command to run and reproduce is provided at script(link...

Not all parameters have been created until after the first step if creating parameters via tracing (during runtime). Can confirm this works thanks to Rahul H: def forward_hook(self, module, inputs,...

Instead of histograms, save them as scalar summaries for the reduction chosen

Current impl : If there are n steps present, all n steps index would be downloaded before the call finishes. We should allow a way to randomly access step. Use...

enhancement

Current CI doesn't support gpu tests.

enhancement