DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

fix regression in shard checkpoint loading in AutoTP Path caused by qkv_copy() is deleted and add UT case for shard checkpoint loading in AutoTP

Open sywangyi opened this issue 1 year ago • 1 comments

  1. I add UT for the shard loading in AutoTP path, because I find the code could not be tested in CI and error like "qkv_copy() is replaced by strided_copy()" is not found when merge.
  2. the UT covers "bigscience/bloom-560m", "EleutherAI/gpt-j-6B", "EleutherAI/gpt-neo-125M", "facebook/opt-125m". and I also fix the problem found in gpt-neo-125m and opt-125m

sywangyi avatar May 05 '23 11:05 sywangyi

@tjruwase @delock @yao-matrix

sywangyi avatar May 05 '23 11:05 sywangyi

@tjruwase I added another commit for https://github.com/microsoft/DeepSpeed/commit/db26f8b41325be2a7f7af8b386b4e8951a5a76c9, the latest merged code suppose only KI path support shard loading, actually I have already added the support, see the usage in UT

sywangyi avatar May 10 '23 07:05 sywangyi