fix some bug
Bug Fixes Summary
-
Bug 1: In the
_allreduce_word_embedding_gradsfunction, theembeddingmight be frozen, which can cause the code to crash. -
Bug 2:
_ParamAndGradBufferfails to updateparam_start_indexduring initialization, potentially causing a parameter to be placed into multiple buckets. This triggers the assertion error:assert param not in param_gbuf_map, ( "Param should not be in param_gbuf_map; each param only belongs " "to a single bucket." ) -
Bug 3: In
on_save_checkpoint_success, wandb'sartifact.add_referencerequires absolute paths forcheckpoint_path. Using relative paths will result in errors.
Thank you for providing this bug fix. We will review it as soon as possible.
Merged in https://github.com/NVIDIA/Megatron-LM/commit/afb755f548b48151a4408b0e9caf674b8349b589 Many thanks to the Infrastructure Center of Tencent WeChat's Technical Architecture Department(微信技术架构部-基础架构中心) for their contributions. cc @shifangx