sagemaker-debugger
sagemaker-debugger copied to clipboard
Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors
Q. when creating a custom collection, is there a way to define EVAL/TRAIN save_interval directly in the SageMaker Estimator? ANS: Yes, it can be provided, for details see this section...
Running the following script with tensorflow==1.15.0: ``` import tensorflow.compat.v2 as tf import smdebug.tensorflow as smd from tempfile import TemporaryDirectory mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test...
In CI : https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=DO-NOT-DELETE-smdebug_rules-LOGS-ONE-REPO;stream=codebuild/c3bda538-9277-42db-931a-de5984013923;filter=%22Loaded%20Index%20Files:%20upload/20200106_221841/c33ae10/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578351365.7939517/index/000000000/000000000070_worker_0.json%22 Why is this line repeated so many times: "Loaded Index Files: upload/20200106_221841/c33ae10/s3_trials/trial_loss_not_decreasing_tf_true_parallel_mode_1578351365.7939517/index/000000000/000000000070_worker_0.json" Are we reloading index files again and again ? @NihalHarish Please check and confirm
Come up with a way so that CI prints the running time for each tests. Find what integration tests are running longer and optimize them to make them run fast....
If I use the script tf_simple.py and use monitoredSession(hook) , I see in-consistent behavior. Link to script - https://gist.github.com/Vikas-kum/a726aa05f70cbc22da55aac6f9f122d2 Repro - Command to run and reproduce is provided at script(link...
Not all parameters have been created until after the first step if creating parameters via tracing (during runtime). Can confirm this works thanks to Rahul H: def forward_hook(self, module, inputs,...
Instead of histograms, save them as scalar summaries for the reduction chosen
Current impl : If there are n steps present, all n steps index would be downloaded before the call finishes. We should allow a way to randomly access step. Use...