"Bus error (core dumped)" when saving recipe state after restarting training
(machine: Fedora Linux, 1xH100 gpu)
Run tune run lora_finetune_single_device --config llama3/8B_lora_single_device seed=0 epochs=4 max_steps_per_epoch=2
Then interrupt after one epoch has been successfully saved to disk
Then run tune run lora_finetune_single_device --config llama3/8B_lora_single_device seed=0 epochs=4 max_steps_per_epoch=2 checkpointer.checkpoint_dir=<out_dir> checkpointer.adapter_checkpoint=adapter_0.pt checkpointer.recipe_checkpoint=recipe_state.pt resume_from_checkpoint=True
Then when writing the next checkpoint, I get Bus error (core dumped) consistently after writing 804mb of the 11mb recipe state.
If I remove the opt state from the recipe state, then it saves successfully. Weirdly, if I call torch.save on just the opt state, it also saves successfully.
Setting mmap=False when loading the recipe state fixes the issue.
Writing the recipe state to a new file instead of overwriting the old one also fixes the issue.
cc @kartikayk
Closed by #1027