sagemaker-python-sdk
sagemaker-python-sdk copied to clipboard
Sagemaker arbitrarily stops copying checkpoints to S3
Describe the bug Sagemaker arbitrarily stops copying checkpoints to S3, copy those up to epoch 9 (see logs below) and then that's it (although the logs show that more have been generated). Similarly, it does not update the TensorBoard events file. I found this problem in two separate training jobs.
To reproduce
- Use the PyTorch container: 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04
- Create checkpoints with PyTorch Lightning
Expected behavior That every time a checkpoint is generated it is copied.
Screenshots or logs 2022-04-11T16:17:42.292+02:00 [2022-04-11 14:17:41,719][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 0, global step 2048: 'loss/val' reached 0.41547 (best 0.41547), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=0-step=2048-val_loss=0.415-val_metric=0.751.ckpt' as top 1 2022-04-11T16:36:41.905+02:00 [2022-04-11 14:36:41,197][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 1, global step 4096: 'loss/val' reached 0.40734 (best 0.40734), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=1-step=4096-val_loss=0.407-val_metric=0.757.ckpt' as top 1 2022-04-11T17:33:41.761+02:00 [2022-04-11 15:33:41,324][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 4, global step 10240: 'loss/val' reached 0.39793 (best 0.39793), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=4-step=10240-val_loss=0.398-val_metric=0.763.ckpt' as top 1 2022-04-11T17:52:42.372+02:00 [2022-04-11 15:52:41,687][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 5, global step 12288: 'loss/val' reached 0.36040 (best 0.36040), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=5-step=12288-val_loss=0.360-val_metric=0.799.ckpt' as top 1 2022-04-11T18:11:42.983+02:00 [2022-04-11 16:11:42,201][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 6, global step 14336: 'loss/val' reached 0.35563 (best 0.35563), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=6-step=14336-val_loss=0.356-val_metric=0.807.ckpt' as top 1 2022-04-11T18:33:23.635+02:00 [2022-04-11 16:33:22,824][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 7, global step 16384: 'loss/val' reached 0.34574 (best 0.34574), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=7-step=16384-val_loss=0.346-val_metric=0.812.ckpt' as top 1 2022-04-11T18:52:23.252+02:00 [2022-04-11 16:52:23,118][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 8, global step 18432: 'loss/val' reached 0.34121 (best 0.34121), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=8-step=18432-val_loss=0.341-val_metric=0.811.ckpt' as top 1 2022-04-11T19:11:24.870+02:00 [2022-04-11 17:11:24,002][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 9, global step 20480: 'loss/val' reached 0.33820 (best 0.33820), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=9-step=20480-val_loss=0.338-val_metric=0.816.ckpt' as top 1 2022-04-11T23:18:33.889+02:00 [2022-04-11 21:18:33,078][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 22, global step 47104: 'loss/val' reached 0.33061 (best 0.33061), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=22-step=47104-val_loss=0.331-val_metric=0.818.ckpt' as top 1 2022-04-12T00:15:35.726+02:00 [2022-04-11 22:15:34,962][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 25, global step 53248: 'loss/val' reached 0.32811 (best 0.32811), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=25-step=53248-val_loss=0.328-val_metric=0.823.ckpt' as top 1 2022-04-12T00:34:36.338+02:00 [2022-04-11 22:34:35,541][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 26, global step 55296: 'loss/val' reached 0.32683 (best 0.32683), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=26-step=55296-val_loss=0.327-val_metric=0.824.ckpt' as top 1 2022-04-12T00:53:36.952+02:00 [2022-04-11 22:53:36,407][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 27, global step 57344: 'loss/val' reached 0.32559 (best 0.32559), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=27-step=57344-val_loss=0.326-val_metric=0.823.ckpt' as top 1 2022-04-12T01:12:37.564+02:00 [2022-04-11 23:12:36,826][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 28, global step 59392: 'loss/val' reached 0.32446 (best 0.32446), saving model to '/opt/ml/checkpoints/tb_logs/damaged-concrete-ita-1024-f16/version_0/checkpoints/epoch=28-step=59392-val_loss=0.324-val_metric=0.824.ckpt' as top 1 2022-04-12T08:48:49.344+02:00 [2022-04-12 06:48:48,489][pytorch_lightning.utilities.rank_zero][INFO] - Epoch 52, global step 108544: 'loss/val' reached 0.32346 (best 0.32346), saving model to '/opt/ml/c
System information A description of your system. Please provide: PyTorch container: 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04 ml.p3.8xlarge
Can confirm, this also happens in Tensorboard. This is our simplified script for reproducibility:
Sagemaker script:
from sagemaker.pytorch.estimator import PyTorch
from sagemaker.debugger import TensorBoardOutputConfig
TENSORBOARD_S3_PATH = "s3://something/test-tb/"
def create_estimator():
tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path=TENSORBOARD_S3_PATH,
container_local_output_path="/opt/ml/output/tensorboard",
)
estimator = PyTorch(
entry_point="run.py", framework_version="1.10.0", py_version="py38",
role="role",
instance_count=1, instance_type="ml.g4dn.2xlarge",
output_path = "s3://something/tb-test/",
code_location = "s3://something/tb-test",
tensorboard_output_config=tensorboard_output_config,
source_dir="code",
base_job_name="dbg-tb2",
)
return estimator
if __name__ == "__main__":
estimator = create_estimator()
estimator.fit()
code/run.py
import os
import time
from torch.utils.tensorboard import SummaryWriter
TENSORBOARD_DIR = "/opt/ml/output/tensorboard"
for epoch in range(1000):
time.sleep(2)
logger = SummaryWriter(f"{TENSORBOARD_DIR}/train")
for k in range(6):
logger.add_scalar(f"my_key_{k}", float(epoch*47*k), epoch)
logger.close()
print("done")
os.system("ls -ltr /opt/ml/output/tensorboard/train | tail")
code/requirements.txt
tensorboard==2.7.0
Note that we are not doing any training just logging to Tensorboard. After around 200 epochs, TB logs stop being copied. We see this in logs as the contents of the tensorboard directory in the container:
-rw-r--r-- 1 root root 0 Apr 12 08:40 events.out.tfevents.1649752818.algo-1.32.191.sagemaker-uploaded
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752820.algo-1.32.192
-rw-r--r-- 1 root root 0 Apr 12 08:40 events.out.tfevents.1649752820.algo-1.32.192.sagemaker-uploaded
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752822.algo-1.32.193
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752824.algo-1.32.194
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752826.algo-1.32.195
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752828.algo-1.32.196
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752830.algo-1.32.197
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752832.algo-1.32.198
-rw-r--r-- 1 root root 322 Apr 12 08:40 events.out.tfevents.1649752834.algo-1.32.199
Same is visible in S3 bucket. And training will just report failure after a while after finishing.
Dumping contents of the directory should not affect anything (that is just for ours debugging, whether files are present). Bug is also present when we use only one file.
Small update: When creating and closing logger outside of for loop, this happens in logs (note the time difference):
-rw-r--r-- 1 root root 0 Apr 12 09:07 events.out.tfevents.1649753852.algo-1.32.0.sagemaker-uploaded
-rw-r--r-- 1 root root 140260 Apr 12 09:14 events.out.tfevents.1649753852.algo-1.32.0
And the file in S3 is stale.
Also it does not matter whether you flush before close or not.
I add also that the same job, once I tried to stop it, it remained pending for a lot of time, it took more than one day to stop and now I have a similar problem with another job, I think the two problems are connected:
Stopping | Apr 12, 2022 08:51 UTC | Apr 12, 2022 08:54 UTC | Stopping the training job |
---|---|---|---|
Uploading | Apr 12, 2022 08:54 UTC | - | Uploading generated training model |
Dear maintainers. Given that your product is right now unusable for any longer training on spot instances (checkpointing is broken) and realtime monitoring is impossible without external services (since TB logs are broken), I would expect at least some response from your team. Otherwise I really do not know why we are paying premium for Sagemaker training compared to ordinary EC2 instances.
Edit: Bug is still present. But there might be a difference between running m5.large (bug not present) instance and g4dn.2xlarge instance (bug is present).
Edit2: Bug is present in m5.large instance, but manifests after larger number of writes.
As of today, the bug is still present.
We have faced the same issue. 50 epoch training, last checkpoint that actually synced with S3 is epoch 18. We have lost several days of training and cost us immensely (this was a 32 x A100 job). The training job is hanging after finishing the training, and still pending after manually stopping it (costing us even more for nothing). It would be great to find some solution to this problem ASAP.
As of today, the bug is still present.
Hi folks - sorry you are running across this. Also sorry it took this long to come across my desk! I will see if I can reproduce your error and triage.
While I do this, you should know that at AWS we tend to handle service issues like this via support tickets.
https://docs.aws.amazon.com/awssupport/latest/user/case-management.html
So for every job that you weren't happy with - what you can actually do is create a support ticket with all of the details about this job, including the arn. Then, your support teams can assist you in greater detail.
I will keep looking at this, but in the future, support tickets are the way to go.
I am really happy that somebody is finally looking into this!
However, linked website clearly states, that "If you have Basic Support, you can't create a technical support case". Besides we also created a ticket due to possibility of being billed for 24 hours just for instance in stopping state. Also, other branches of AWS can respond and resolve issues on Github, e.g. here https://github.com/aws/aws-neuron-sdk/issues/398
So, where can people with basic plan create relevant support tickets? @EmilyWebber
Understood - I'm checking with the other AWS team reps now on this.
It seems that bug disappeared, at least for couple test runs I did.
However, if it was fixed, we should be noted. Otherwise our trust in AWS goes down.
Thank you for commenting again @usamec.
It seems that bug disappeared, at least for couple test runs I did.
Would you be able to confirm that you've not seen this issue again since?
This thread starts to be old. We may close it now for triaging if resolved.
Closing for triaging, please feel free to open another issue if this comes again