HF Conversion for On-policy Distillation Trained Models
I'm trying to convert a checkpoint created by uv run python examples/run_distillation_math.py.
I'm running the following command: uv run python examples/converters/convert_dcp_to_hf.py --config {dir_path}/config.yaml --dcp-ckpt-path {dir_path}/policy/weights/ --hf-ckpt-path {output_path}
It gives the following error:
Traceback (most recent call last):
File "/home/coder/Uygar/nemo-rl-policy/examples/converters/convert_dcp_to_hf.py", line 73, in <module>
main()
File "/home/coder/Uygar/nemo-rl-policy/examples/converters/convert_dcp_to_hf.py", line 62, in main
hf_ckpt = convert_dcp_to_hf(
^^^^^^^^^^^^^^^^^^
File "/home/coder/Uygar/nemo-rl-policy/nemo_rl/utils/native_checkpoint.py", line 242, in convert_dcp_to_hf
dcp_to_torch_save(dcp_ckpt_path, weights_path)
File "/home/coder/Uygar/nemo-rl-policy/.venv/lib/python3.12/site-packages/torch/distributed/checkpoint/format_utils.py", line 212, in dcp_to_torch_save
_load_state_dict(
File "/home/coder/Uygar/nemo-rl-policy/.venv/lib/python3.12/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 245, in _load_state_dict
_ = distW.all_gather("read", read_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/coder/Uygar/nemo-rl-policy/.venv/lib/python3.12/site-packages/torch/distributed/checkpoint/utils.py", line 284, in all_gather
raise CheckpointException(step, node_failures)
torch.distributed.checkpoint.api.CheckpointException: CheckpointException ranks:dict_keys([0])
Traceback (most recent call last): (RANK 0)
File "/home/coder/Uygar/nemo-rl-policy/.venv/lib/python3.12/site-packages/torch/distributed/checkpoint/utils.py", line 276, in all_gather
result = map_fun()
^^^^^^^^^
File "/home/coder/Uygar/nemo-rl-policy/.venv/lib/python3.12/site-packages/torch/distributed/checkpoint/logger.py", line 87, in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/coder/Uygar/nemo-rl-policy/.venv/lib/python3.12/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 240, in read_data
all_reads = storage_reader.read_data(final_local_plan, planner)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/coder/Uygar/nemo-rl-policy/.venv/lib/python3.12/site-packages/torch/distributed/checkpoint/filesystem.py", line 829, in read_data
with self.fs.create_stream(new_path, "rb") as stream:
File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
return next(self.gen)
^^^^^^^^^^^^^^
File "/home/coder/Uygar/nemo-rl-policy/.venv/lib/python3.12/site-packages/torch/distributed/checkpoint/filesystem.py", line 511, in create_stream
with path.open(mode) as stream:
^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/pathlib.py", line 1015, in open
return io.open(self, mode, buffering, encoding, errors, newline)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/coder/Uygar/nemo-rl-policy/temp/checkpoints_policy/step_440/policy/weights/__0_0.distcp'
Steps/Code to reproduce bug
Run uv run python examples/converters/convert_dcp_to_hf.py --config {dir_path}/config.yaml --dcp-ckpt-path {dir_path}/policy/weights/ --hf-ckpt-path {output_path} on a on-policy distilled model.
@sharathts can you please take a look and opine
@sharonyu-115 @zpqiu can you also help
It seems that this is not a bug unique to on-policy distillation, but rather a bug in the checkpointing of DTensor V2 policy, and several similar issues #1427 #1391 have been reported previously. @terrykong
@uygarmv sorry for the delay. The quick fix solution is to use DTensor V1 path.
uv run examples/run_distillation_math.py checkpointing.model_save_format=null policy.dtensor_cfg._v2=false
I think our colleagues will address the DTensor V2 path issue to fully resolve this bug.