fms-fsdp
fms-fsdp copied to clipboard
[speculator training] Support for loading different HF checkpoints for speculator training
For currently training a speculator using the specu-train branch, getting OOM error when trying to load a checkpoint in HuggingFace format. The model_type is "gpt_megatron". The script works fine for other Llama checkpoints with model_type "llama"
Checkpoint folder structure
Observed Error
what are the sizes of the files, especially the pytorch_model.bin? do we have a safetensors version? how are we loading it?