dolly
dolly copied to clipboard
OSError: [Errno 28] No space left on device
Hello all,
Unfortunately, I ran out of space training the 3B model. I'm using a p3.16xlarge
instance and it ran out of space on epoch 0.3. Any advice on how to resolve?
Full error:
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160] 2023-04-29 06:01:12 ERROR [__main__] main failed
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160] Traceback (most recent call last):
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160] File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 379, in save
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160] _save(obj, opened_zipfile, pickle_module, pickle_protocol)
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160] File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 604, in _save
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160] zip_file.write_record(name, storage.data_ptr(), num_bytes)
2023-04-29 06:01:12.977 [177/train/578 (pid 28936)] [09c561ee-a44c-4795-967c-7183a83b3160] OSError: [Errno 28] No space left on device
You ran out of disk space, that's all. Where did you save stuff? sometimes your local root volume is small, and most of the storage is in mounted EBS volumes.
Yes you're correct. I am saving checkpoints to a local folder . So, I just set save_steps
argument to a bigger number and load_best_model_at_end
to False