sagemaker-debugger icon indicating copy to clipboard operation
sagemaker-debugger copied to clipboard

Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors

Results 92 sagemaker-debugger issues
Sort by recently updated
recently updated
newest added

Issue: https://github.com/awslabs/sagemaker-debugger/issues/321 ### Description of changes: Adding filtering logic to dest_names and make sure that we always ask for subgraph which is present in graph def #### Style and formatting:...

### Description of changes: - The dataset must be prepared outside of the child processes as opposed to inside of them which leads to race conditons. #### Style and formatting:...

It seems like files get deleted between the time the list of files is created checkpoint_files = self._get_checkpoint_files_in_dir(self._checkpoint_dir) - https://github.com/awslabs/sagemaker-debugger/blob/master/smdebug/core/state_store.py#L92 to timestamps = [os.path.getmtime(file) for file in checkpoint_files] - https://github.com/awslabs/sagemaker-debugger/blob/master/smdebug/core/state_store.py#L99...

I have been looking for an example to use tensorflow_datasets in sagemaker and found the following source. https://github.com/awslabs/sagemaker-debugger/blob/master/tests/tensorflow/test_keras_to_estimator.py I believe this code - test_keras_to_estimator.py - works well with loading data...

In the section [AWS Deep Learning Containers with Zero Code Change](https://github.com/awslabs/sagemaker-debugger#aws-deep-learning-containers-with-zero-code-change) there is a link to 'SageMaker Debugger's Hook' which currently points to the [glossary](https://github.com/awslabs/sagemaker-debugger/blob/master/api.md#glossary) which doesn't not exist anymore

When I add a debugger callback ` try: debug_hook = smd.KerasHook.create_from_json_file() callbacks.append(debug_hook) except FileNotFoundError: log_dir = "tensorboard-log/" callbacks.append(tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1))` Then I try to save the model after training: `model.save(args.model_dir)` I...

``` out_dir = os.path.join(tmpdir, str(uuid.uuid4())) hook = Hook(out_dir=out_dir, include_collections=['all']) hook.get_collection("all").save_config = SaveConfig(save_interval=3) ``` The save config set for "all" collection gets set only for "all" instead of applying on all...

``` from tests.tensorflow.keras.test_keras import train_model train_model( out_dir, save_all=True, use_tf_keras=True, save_config=SaveConfig(save_steps=[0, 10]), eager=False, steps=["train", "eval", "predict", "train"], ) print(create_trial_fast_refresh(out_dir).tensor_names(step=10)) ```

### Description of changes: - This PR reduces the runtime of the ZCC unit tests - We should ideally move save_all tests to nightly #### Style and formatting: I have...

Noticed this for TF2. `helper_keras_gradtape` and `helper_keras_fit` in tensorflow2/test_keras.py