fms-fsdp [speculator training] Support for loading different HF checkpoints for speculator training

[speculator training] Support for loading different HF checkpoints for speculator training

Open pavi2707 opened this issue 10 months ago • 1 comments

For currently training a speculator using the specu-train branch, getting OOM error when trying to load a checkpoint in HuggingFace format. The model_type is "gpt_megatron". The script works fine for other Llama checkpoints with model_type "llama"

Checkpoint folder structure Screenshot 2024-03-28 at 12 57 24 PM

Observed Error Screenshot 2024-03-28 at 12 58 55 PM

Mar 28 '24 17:03 pavi2707

what are the sizes of the files, especially the pytorch_model.bin? do we have a safetensors version? how are we loading it?

Mar 28 '24 22:03 nairbv

fms-fsdp fms-fsdp copied to clipboard

[speculator training] Support for loading different HF checkpoints for speculator training

fms-fsdp
fms-fsdp copied to clipboard