torchtune "Bus error (core dumped)" when saving recipe state after restarting training

(machine: Fedora Linux, 1xH100 gpu)

Run tune run lora_finetune_single_device --config llama3/8B_lora_single_device seed=0 epochs=4 max_steps_per_epoch=2

Then interrupt after one epoch has been successfully saved to disk

Then run tune run lora_finetune_single_device --config llama3/8B_lora_single_device seed=0 epochs=4 max_steps_per_epoch=2 checkpointer.checkpoint_dir=<out_dir> checkpointer.adapter_checkpoint=adapter_0.pt checkpointer.recipe_checkpoint=recipe_state.pt resume_from_checkpoint=True

Then when writing the next checkpoint, I get Bus error (core dumped) consistently after writing 804mb of the 11mb recipe state.

If I remove the opt state from the recipe state, then it saves successfully. Weirdly, if I call torch.save on just the opt state, it also saves successfully.

Setting mmap=False when loading the recipe state fixes the issue.

Writing the recipe state to a new file instead of overwriting the old one also fixes the issue.

May 24 '24 17:05 calvinpelletier

cc @kartikayk

May 24 '24 17:05 ebsmothers

Closed by #1027

Jun 10 '24 14:06 joecummings