torchtitan icon indicating copy to clipboard operation
torchtitan copied to clipboard

Converting to checkpoint.pd is not working

Open viai957 opened this issue 9 months ago • 4 comments

I did follow all the instructions mentioned in the checkpoint.md after running this command successful the checkpoint.pt file was not created i did search the whole dir I did not find it anywhere python -m torch.distributed.checkpoint.format_utils dcp_to_torch torchtitan/outputs/checkpoint/step-500 checkpoint.pt Converting checkpoint from torchtitan/outputs/checkpoint/step-500 to checkpoint.pt using method: 'dcp_to_torch' image

viai957 avatar May 04 '24 13:05 viai957

Can anybody help me with this

viai957 avatar May 06 '24 00:05 viai957

It's a bug in the torch.distributed.checkpoint.format_utils and now it's already fixed in the main branch ( https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/format_utils.py#L265 ). The problem was casued by missing .value in elif args.mode == FormatMode.DCP_TO_TORCH.value:.

I use my own script for the conversion that is a little more customized. You can find it here: https://github.com/chrisociepa/allamo/blob/fsdp2/scripts/convert_dcp.py

chrisociepa avatar May 06 '24 16:05 chrisociepa

Ohh I see @chrisociepa Thank you

viai957 avatar May 07 '24 01:05 viai957

Had the same issue here and @chrisociepa's script is useful to me.

XinDongol avatar May 07 '24 15:05 XinDongol

As @chrisociepa mentioned, the fix(https://github.com/pytorch/pytorch/pull/1234070) is landed in main. Therefore, closing the issue.

wz337 avatar May 15 '24 22:05 wz337