dolly icon indicating copy to clipboard operation
dolly copied to clipboard

OSError: [Errno 28] No space left on device

Open rileyhun opened this issue 1 year ago • 1 comments

Hello all,

Unfortunately, I ran out of space training the 3B model. I'm using a p3.16xlarge instance and it ran out of space on epoch 0.3. Any advice on how to resolve?

Full error:

2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160] 2023-04-29 06:01:12 ERROR [__main__] main failed
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160] Traceback (most recent call last):
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160]   File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 379, in save
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160]     _save(obj, opened_zipfile, pickle_module, pickle_protocol)
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160]   File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 604, in _save
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160]     zip_file.write_record(name, storage.data_ptr(), num_bytes)
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160] OSError: [Errno 28] No space left on device

rileyhun avatar Apr 29 '23 06:04 rileyhun

You ran out of disk space, that's all. Where did you save stuff? sometimes your local root volume is small, and most of the storage is in mounted EBS volumes.

srowen avatar Apr 29 '23 15:04 srowen

Yes you're correct. I am saving checkpoints to a local folder . So, I just set save_steps argument to a bigger number and load_best_model_at_end to False

rileyhun avatar Apr 29 '23 19:04 rileyhun