sagemaker-debugger
sagemaker-debugger copied to clipboard
TensorBoardOutputConfig/Sagemaker Debugger does not behave as documented
I have a working pytorch training script, which runs on my local machine. I'm trying to set it up to run on Sagemaker. I want to save data to TensorBoard, get this uploaded into S3, and visualize via TensorBoard, like the documentation says here: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_debugger.html#capture-real-time-tensorboard-data-from-the-debugging-hook
As mentioned, this works perfectl when running on my dev machine. On Sagemaker, I do this:
tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path=f's3://{args.s3_bucket}/tensorboard/{args.noise}',
container_local_output_path=f'/tb-logs'
)
then run it using:
job = PyTorch(..., tensorboard_output_config=tensorboard_output_config, ...)
Sagemaker creates the TensorBoard folder on S3, and seems to be logging everything I need. However, it is trying to log bunch of other things which I haven't specified. When the training starts, I see this in the log:
[2020-06-29 13:44:20.083 algo-1:52 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2020-06-29 13:44:20.084 algo-1:52 INFO hook.py:236] Saving to /opt/ml/output/tensors
[2020-06-29 13:44:20.084 algo-1:52 INFO state_store.py:67] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[2020-06-29 13:44:20.100 algo-1:52 INFO hook.py:376] Monitoring the collections: losses
[2020-06-29 13:44:20.100 algo-1:52 INFO hook.py:437] Hook is writing from the hook with pid: 52
Sagemaker is trying to log the losses from my networks, because they are all added into TensorBoard, with each of the losses having one value. On TensorBoard, for every run, there are multiple directories and files. One of them contains all the information that I am logging myself. All the others are created my Sagemaker; they are clogging the UI and IMHO should not be there.
How can I turn this off?
On TensorBoard, for every run, there are multiple directories and files. One of them contains all the information that I am logging myself. All the others are created my Sagemaker; they are clogging the UI and IMHO should not be there.
@ando-khachatryan Would you be able to provide what directories are created and may be screenshot of UI. Also, it would help if you can provide a way to reproduce the problem.
Thanks.
Thanks for the prompt reply.
Screenshots:
S3, TensorBoard folder.
TensorBoard, showing the "runs".
TensorBoard, showing the "runs" and the scalars.
All the scalar groups except the last two (train and validation) are added by Sagemaker. So out of the four runs shown in TB, only the first is the info added by me. On the S3 screenshot, these are the three folders.
Sorry for not providing a reproducible example. This is not-too-small project, but I have a git repo, I can push it and give you a small zip of the dataset. Would that work?
thanks for providing screenshot.
Sorry for not providing a reproducible example. This is not-too-small project, but I have a git repo, I can push it and give you a small zip of the dataset. Would that work?
Yes, that will help. Please provide the scripts, data and instruction to run the script.
Git repo & branch: https://github.com/ando-khachatryan/HiDDeN/tree/sagemaker-debug-issue
data: https://ando-public-sagemaker.s3.amazonaws.com/small.zip Extract and put the files into your S3 bucket.
% cd wm
% cp util/sagemaker_config_example.py util/sagemaker_config.py
Open and fill in the Sagemaker execution ARN.
To run on Sagemaker, you need boto3 and sagemaker packages installed.
wm % python sagemaker_start.py unet-conv --noise 'jpeg()+blur()' --data-prefix <data-on-s3> --epochs 50 --wait --s3-bucket <bucket-name>
For instance, my data sits in data/small on my bucket, then --data-prefix is 'data/small' for me.
Let me know if I have forgotten something. Thank you once again for your time.
Update:
I found an issue on in my code, fixed it and pushed. However the problem is not completely gone.
For diagnostics, I removed all statements which write into TensorBoard. So right now I am just creating a SummaryWriter object, and that's it. It still shows a BCEWithLogitsLoss_output_0, and I have no clue where it is coming from.
I feel that this is not related to smdebug. IIRC, SMDebug doesn't create those s3 folder.
Can you confirm that you don't see those folder created locally when you run your script on local machine?
SMDebug s3 output would look like this -
Yes, I ran it on the local machine and there is only the the file within the TensorBoard folder which contains the info that I write. The GLOBAL folder is not there when I run it locally.
I also tried to remove ALL statements which write to TensorBoard, and run it on Sagemaker. This would just create the file (which does not contain anything, as expected), plus it created the GLOBAL folder.
I checked the folders you mentioned. They are created on S3:


And this is the GLOBAL folder inside TensorBoard folder, the one that should not be there:

I'm having a similar issue. Did you find a fix for this @ando-khachatryan?
@matthewmoss Nope, ended up just deleting those redundant files every time after the simulation.