sagemaker-debugger icon indicating copy to clipboard operation
sagemaker-debugger copied to clipboard

TensorBoardOutputConfig/Sagemaker Debugger does not behave as documented

Open ando-khachatryan opened this issue 4 years ago • 9 comments

I have a working pytorch training script, which runs on my local machine. I'm trying to set it up to run on Sagemaker. I want to save data to TensorBoard, get this uploaded into S3, and visualize via TensorBoard, like the documentation says here: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_debugger.html#capture-real-time-tensorboard-data-from-the-debugging-hook

As mentioned, this works perfectl when running on my dev machine. On Sagemaker, I do this:

    tensorboard_output_config = TensorBoardOutputConfig(
        s3_output_path=f's3://{args.s3_bucket}/tensorboard/{args.noise}',
        container_local_output_path=f'/tb-logs'
    )

then run it using: job = PyTorch(..., tensorboard_output_config=tensorboard_output_config, ...)

Sagemaker creates the TensorBoard folder on S3, and seems to be logging everything I need. However, it is trying to log bunch of other things which I haven't specified. When the training starts, I see this in the log:

[2020-06-29 13:44:20.083 algo-1:52 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2020-06-29 13:44:20.084 algo-1:52 INFO hook.py:236] Saving to /opt/ml/output/tensors
[2020-06-29 13:44:20.084 algo-1:52 INFO state_store.py:67] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[2020-06-29 13:44:20.100 algo-1:52 INFO hook.py:376] Monitoring the collections: losses
[2020-06-29 13:44:20.100 algo-1:52 INFO hook.py:437] Hook is writing from the hook with pid: 52

Sagemaker is trying to log the losses from my networks, because they are all added into TensorBoard, with each of the losses having one value. On TensorBoard, for every run, there are multiple directories and files. One of them contains all the information that I am logging myself. All the others are created my Sagemaker; they are clogging the UI and IMHO should not be there.

How can I turn this off?

ando-khachatryan avatar Jun 29 '20 14:06 ando-khachatryan

On TensorBoard, for every run, there are multiple directories and files. One of them contains all the information that I am logging myself. All the others are created my Sagemaker; they are clogging the UI and IMHO should not be there.

@ando-khachatryan Would you be able to provide what directories are created and may be screenshot of UI. Also, it would help if you can provide a way to reproduce the problem.

Thanks.

Vikas-kum avatar Jun 29 '20 16:06 Vikas-kum

Thanks for the prompt reply. Screenshots: S3, TensorBoard folder. Screen Shot 2020-06-29 at 21 35 04

TensorBoard, showing the "runs". Screen Shot 2020-06-29 at 21 20 19

TensorBoard, showing the "runs" and the scalars. Screen Shot 2020-06-29 at 21 32 33

All the scalar groups except the last two (train and validation) are added by Sagemaker. So out of the four runs shown in TB, only the first is the info added by me. On the S3 screenshot, these are the three folders.

Sorry for not providing a reproducible example. This is not-too-small project, but I have a git repo, I can push it and give you a small zip of the dataset. Would that work?

ando-khachatryan avatar Jun 29 '20 17:06 ando-khachatryan

thanks for providing screenshot.

Sorry for not providing a reproducible example. This is not-too-small project, but I have a git repo, I can push it and give you a small zip of the dataset. Would that work?

Yes, that will help. Please provide the scripts, data and instruction to run the script.

Vikas-kum avatar Jun 29 '20 17:06 Vikas-kum

Git repo & branch: https://github.com/ando-khachatryan/HiDDeN/tree/sagemaker-debug-issue

data: https://ando-public-sagemaker.s3.amazonaws.com/small.zip Extract and put the files into your S3 bucket.

% cd wm 
% cp util/sagemaker_config_example.py util/sagemaker_config.py

Open and fill in the Sagemaker execution ARN.

To run on Sagemaker, you need boto3 and sagemaker packages installed. wm % python sagemaker_start.py unet-conv --noise 'jpeg()+blur()' --data-prefix <data-on-s3> --epochs 50 --wait --s3-bucket <bucket-name> For instance, my data sits in data/small on my bucket, then --data-prefix is 'data/small' for me.

Let me know if I have forgotten something. Thank you once again for your time.

ando-khachatryan avatar Jun 29 '20 19:06 ando-khachatryan

Update: I found an issue on in my code, fixed it and pushed. However the problem is not completely gone. Screen Shot 2020-07-02 at 01 29 11

For diagnostics, I removed all statements which write into TensorBoard. So right now I am just creating a SummaryWriter object, and that's it. It still shows a BCEWithLogitsLoss_output_0, and I have no clue where it is coming from.

ando-khachatryan avatar Jul 01 '20 21:07 ando-khachatryan

I feel that this is not related to smdebug. IIRC, SMDebug doesn't create those s3 folder.
Can you confirm that you don't see those folder created locally when you run your script on local machine? SMDebug s3 output would look like this - Screen Shot 2020-07-01 at 7 03 04 PM

Vikas-kum avatar Jul 02 '20 02:07 Vikas-kum

Yes, I ran it on the local machine and there is only the the file within the TensorBoard folder which contains the info that I write. The GLOBAL folder is not there when I run it locally.

I also tried to remove ALL statements which write to TensorBoard, and run it on Sagemaker. This would just create the file (which does not contain anything, as expected), plus it created the GLOBAL folder.

I checked the folders you mentioned. They are created on S3:

Screen Shot 2020-07-02 at 22 54 37 This is the screenshot inside the events subfolder: Screen Shot 2020-07-02 at 22 56 37

And this is the GLOBAL folder inside TensorBoard folder, the one that should not be there:

Screen Shot 2020-07-02 at 22 55 27 I thought the file names have some similarity.

ando-khachatryan avatar Jul 02 '20 19:07 ando-khachatryan

I'm having a similar issue. Did you find a fix for this @ando-khachatryan?

matthewmoss avatar Dec 08 '20 19:12 matthewmoss

@matthewmoss Nope, ended up just deleting those redundant files every time after the simulation.

ando-khachatryan avatar Dec 09 '20 15:12 ando-khachatryan