sagemaker-python-sdk Sagemaker arbitrarily stops copying checkpoints to S3

Describe the bug Sagemaker arbitrarily stops copying checkpoints to S3, copy those up to epoch 9 (see logs below) and then that's it (although the logs show that more have been generated). Similarly, it does not update the TensorBoard events file. I found this problem in two separate training jobs.

To reproduce

Use the PyTorch container: 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04
Create checkpoints with PyTorch Lightning

Expected behavior That every time a checkpoint is generated it is copied.

Screenshots or logs 2022-04-11T16:17:42.292+02:00 [2022-04-11 14:17:41,719][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 0, global step 2048: 'loss/val' reached 0.41547 (best 0.41547), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=0-step=2048-val_loss=0.415-val_metric=0.751.ckpt' as top 1 2022-04-11T16:36:41.905+02:00 [2022-04-11 14:36:41,197][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 1, global step 4096: 'loss/val' reached 0.40734 (best 0.40734), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=1-step=4096-val_loss=0.407-val_metric=0.757.ckpt' as top 1 2022-04-11T17:33:41.761+02:00 [2022-04-11 15:33:41,324][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 4, global step 10240: 'loss/val' reached 0.39793 (best 0.39793), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=4-step=10240-val_loss=0.398-val_metric=0.763.ckpt' as top 1 2022-04-11T17:52:42.372+02:00 [2022-04-11 15:52:41,687][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 5, global step 12288: 'loss/val' reached 0.36040 (best 0.36040), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=5-step=12288-val_loss=0.360-val_metric=0.799.ckpt' as top 1 2022-04-11T18:11:42.983+02:00 [2022-04-11 16:11:42,201][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 6, global step 14336: 'loss/val' reached 0.35563 (best 0.35563), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=6-step=14336-val_loss=0.356-val_metric=0.807.ckpt' as top 1 2022-04-11T18:33:23.635+02:00 [2022-04-11 16:33:22,824][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 7, global step 16384: 'loss/val' reached 0.34574 (best 0.34574), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=7-step=16384-val_loss=0.346-val_metric=0.812.ckpt' as top 1 2022-04-11T18:52:23.252+02:00 [2022-04-11 16:52:23,118][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 8, global step 18432: 'loss/val' reached 0.34121 (best 0.34121), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=8-step=18432-val_loss=0.341-val_metric=0.811.ckpt' as top 1 2022-04-11T19:11:24.870+02:00 [2022-04-11 17:11:24,002][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 9, global step 20480: 'loss/val' reached 0.33820 (best 0.33820), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=9-step=20480-val_loss=0.338-val_metric=0.816.ckpt' as top 1 2022-04-11T23:18:33.889+02:00 [2022-04-11 21:18:33,078][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 22, global step 47104: 'loss/val' reached 0.33061 (best 0.33061), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=22-step=47104-val_loss=0.331-val_metric=0.818.ckpt' as top 1 2022-04-12T00:15:35.726+02:00 [2022-04-11 22:15:34,962][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 25, global step 53248: 'loss/val' reached 0.32811 (best 0.32811), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=25-step=53248-val_loss=0.328-val_metric=0.823.ckpt' as top 1 2022-04-12T00:34:36.338+02:00 [2022-04-11 22:34:35,541][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 26, global step 55296: 'loss/val' reached 0.32683 (best 0.32683), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=26-step=55296-val_loss=0.327-val_metric=0.824.ckpt' as top 1 2022-04-12T00:53:36.952+02:00 [2022-04-11 22:53:36,407][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 27, global step 57344: 'loss/val' reached 0.32559 (best 0.32559), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=27-step=57344-val_loss=0.326-val_metric=0.823.ckpt' as top 1 2022-04-12T01:12:37.564+02:00 [2022-04-11 23:12:36,826][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 28, global step 59392: 'loss/val' reached 0.32446 (best 0.32446), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=28-step=59392-val_loss=0.324-val_metric=0.824.ckpt' as top 1 2022-04-12T08:48:49.344+02:00 [2022-04-12 06:48:48,489][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 52, global step 108544: 'loss/val' reached 0.32346 (best 0.32346), saving model to '/opt/ml/c

System information A description of your system. Please provide: PyTorch container: 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04 ml.p3.8xlarge

Apr 12 '22 07:04 mnslarcher

Can confirm, this also happens in Tensorboard. This is our simplified script for reproducibility:

Sagemaker script:

from sagemaker.pytorch.estimator import PyTorch
from sagemaker.debugger import TensorBoardOutputConfig

TENSORBOARD_S3_PATH = "s3://something/test-tb/"

def create_estimator():
    tensorboard_output_config = TensorBoardOutputConfig(
        s3_output_path=TENSORBOARD_S3_PATH,
        container_local_output_path="/opt/ml/output/tensorboard",
    )
    estimator = PyTorch(
        entry_point="run.py", framework_version="1.10.0", py_version="py38",
        role="role",
        instance_count=1, instance_type="ml.g4dn.2xlarge",
        output_path = "s3://something/tb-test/",
        code_location = "s3://something/tb-test",
        tensorboard_output_config=tensorboard_output_config,
        source_dir="code",
        base_job_name="dbg-tb2",
    )
    return estimator


if __name__ == "__main__":
    estimator = create_estimator()
    estimator.fit()

code/run.py

import os
import time
from torch.utils.tensorboard import SummaryWriter

TENSORBOARD_DIR = "/opt/ml/output/tensorboard"

for epoch in range(1000):
    time.sleep(2)
    logger = SummaryWriter(f"{TENSORBOARD_DIR}/train")
    for k in range(6):
        logger.add_scalar(f"my_key_{k}", float(epoch*47*k), epoch) 
    logger.close()
    print("done")
    os.system("ls -ltr /opt/ml/output/tensorboard/train | tail")

code/requirements.txt

tensorboard==2.7.0

Note that we are not doing any training just logging to Tensorboard. After around 200 epochs, TB logs stop being copied. We see this in logs as the contents of the tensorboard directory in the container:

-rw-r--r-- 1 root root   0 Apr 12 08:40 events.out.tfevents.1649752818.algo-1.32.191.sagemaker-uploaded
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752820.algo-1.32.192
-rw-r--r-- 1 root root   0 Apr 12 08:40 events.out.tfevents.1649752820.algo-1.32.192.sagemaker-uploaded
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752822.algo-1.32.193
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752824.algo-1.32.194
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752826.algo-1.32.195
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752828.algo-1.32.196
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752830.algo-1.32.197
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752832.algo-1.32.198
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752834.algo-1.32.199

Same is visible in S3 bucket. And training will just report failure after a while after finishing.

Dumping contents of the directory should not affect anything (that is just for ours debugging, whether files are present). Bug is also present when we use only one file.

Small update: When creating and closing logger outside of for loop, this happens in logs (note the time difference):

-rw-r--r-- 1 root root      0 Apr 12 09:07 events.out.tfevents.1649753852.algo-1.32.0.sagemaker-uploaded
-rw-r--r-- 1 root root 140260 Apr 12 09:14 events.out.tfevents.1649753852.algo-1.32.0

And the file in S3 is stale.

Also it does not matter whether you flush before close or not.

Apr 12 '22 09:04 usamec

I add also that the same job, once I tried to stop it, it remained pending for a lot of time, it took more than one day to stop and now I have a similar problem with another job, I think the two problems are connected:

Stopping	Apr 12, 2022 08:51 UTC	Apr 12, 2022 08:54 UTC	Stopping the training job
Uploading	Apr 12, 2022 08:54 UTC	-	Uploading generated training model

Apr 12 '22 12:04 mnslarcher

Dear maintainers. Given that your product is right now unusable for any longer training on spot instances (checkpointing is broken) and realtime monitoring is impossible without external services (since TB logs are broken), I would expect at least some response from your team. Otherwise I really do not know why we are paying premium for Sagemaker training compared to ordinary EC2 instances.

Apr 15 '22 19:04 usamec

Edit: Bug is still present. But there might be a difference between running m5.large (bug not present) instance and g4dn.2xlarge instance (bug is present).

Edit2: Bug is present in m5.large instance, but manifests after larger number of writes.

Apr 19 '22 08:04 usamec

As of today, the bug is still present.

Apr 26 '22 07:04 usamec

We have faced the same issue. 50 epoch training, last checkpoint that actually synced with S3 is epoch 18. We have lost several days of training and cost us immensely (this was a 32 x A100 job). The training job is hanging after finishing the training, and still pending after manually stopping it (costing us even more for nothing). It would be great to find some solution to this problem ASAP.

Apr 28 '22 18:04 z-fabian

As of today, the bug is still present.

May 05 '22 06:05 usamec

Hi folks - sorry you are running across this. Also sorry it took this long to come across my desk! I will see if I can reproduce your error and triage.

While I do this, you should know that at AWS we tend to handle service issues like this via support tickets.

https://docs.aws.amazon.com/awssupport/latest/user/case-management.html

So for every job that you weren't happy with - what you can actually do is create a support ticket with all of the details about this job, including the arn. Then, your support teams can assist you in greater detail.

I will keep looking at this, but in the future, support tickets are the way to go.

May 10 '22 21:05 EmilyWebber

I am really happy that somebody is finally looking into this!

However, linked website clearly states, that "If you have Basic Support, you can't create a technical support case". Besides we also created a ticket due to possibility of being billed for 24 hours just for instance in stopping state. Also, other branches of AWS can respond and resolve issues on Github, e.g. here https://github.com/aws/aws-neuron-sdk/issues/398

So, where can people with basic plan create relevant support tickets? @EmilyWebber

May 13 '22 08:05 usamec

Understood - I'm checking with the other AWS team reps now on this.

May 13 '22 16:05 EmilyWebber

It seems that bug disappeared, at least for couple test runs I did.

However, if it was fixed, we should be noted. Otherwise our trust in AWS goes down.

May 30 '22 15:05 usamec

Thank you for commenting again @usamec.

It seems that bug disappeared, at least for couple test runs I did.

Would you be able to confirm that you've not seen this issue again since?

This thread starts to be old. We may close it now for triaging if resolved.

Nov 16 '23 10:11 martinRenou

Closing for triaging, please feel free to open another issue if this comes again

Nov 22 '23 13:11 martinRenou

sagemaker-python-sdk sagemaker-python-sdk copied to clipboard

Sagemaker arbitrarily stops copying checkpoints to S3

sagemaker-python-sdk
sagemaker-python-sdk copied to clipboard