Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

fix some bug

Open Thaurun opened this issue 8 months ago • 2 comments

Bug Fixes Summary

  1. Bug 1: In the _allreduce_word_embedding_grads function, the embedding might be frozen, which can cause the code to crash.

  2. Bug 2: _ParamAndGradBuffer fails to update param_start_index during initialization, potentially causing a parameter to be placed into multiple buckets. This triggers the assertion error:

    assert param not in param_gbuf_map, (
        "Param should not be in param_gbuf_map; each param only belongs "
        "to a single bucket."
    )
    
  3. Bug 3: In on_save_checkpoint_success, wandb's artifact.add_reference requires absolute paths for checkpoint_path. Using relative paths will result in errors.

Thaurun avatar Apr 17 '25 05:04 Thaurun

Thank you for providing this bug fix. We will review it as soon as possible.

xuwchen avatar Apr 23 '25 04:04 xuwchen

Merged in https://github.com/NVIDIA/Megatron-LM/commit/afb755f548b48151a4408b0e9caf674b8349b589 Many thanks to the Infrastructure Center of Tencent WeChat's Technical Architecture Department(微信技术架构部-基础架构中心) for their contributions. cc @shifangx

yanring avatar May 05 '25 09:05 yanring