DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Optimizer state loading fix for bitsandbytes 8-bit optimizers.

Open TimDettmers opened this issue 3 years ago • 8 comments

This is a fix for loading bitsandbytes 8-bit optimizer with optimizer sharding. By default, DeepSpeed shards all tensors held by the optimizer state automatically. However, 8-bit optimizers also hold the quantization datatype (qmap1 and qmap2 in the case of Adam) without which the 8-bit state is undefined. The full datatype is needed to perform quantization and restore the optimizer state and if this tensor is sharded the optimizer state cannot be recovered.

This is a quick fix for the BigScience team. See this PR for more info. A more general fix would be to be able to specify in the config if some tensor keys do not need to be sharded. Alternatively, in this case, it would also suffice to disable sharding for tensors equal or smaller than the size of 256 elements (the size of the 8-bit data type).

TimDettmers avatar Nov 22 '21 06:11 TimDettmers

To make it generic we need some sort of a black-list of keys not to shard configurable by the user then. Probably exact match would be safer.

additional info: This is ZeRO-1!

@tjruwase, @jeffra

stas00 avatar Nov 22 '21 17:11 stas00

To add a bit more context: I also tried to solve this on the bnb side by casting the tensor to a list before the optimizer is saved, but I could not figure out what the right hook was to trigger the tensor->list conversion. I expected DeepSpeed calls getstate of the optimizer to be used, or state_dict() but neither worked for me. What methods of the optimizer do get called when one saves a sharded optimizer to disk?

TimDettmers avatar Nov 22 '21 18:11 TimDettmers

I expected DeepSpeed calls getstate of the optimizer to be used, or state_dict() but neither worked for me. What methods of the optimizer do get called when one saves a sharded optimizer to disk?

For ZeRO we should be calling the optimizer's state_dict() method, this pending PR fixes this issue https://github.com/microsoft/DeepSpeed/pull/1525 since we've seen issues in certain cases by not calling the optimizer state_dict like this. However the downside of doing this is then ZeRO does not support loading the checkpoint with a different data parallel (DP) size than what was used to save the checkpoint with. The code currently in master saves the checkpoint in an "elastic" way that strips the padding from the optimizer states so it can be restored with new DP sizes.

jeffra avatar Dec 01 '21 01:12 jeffra

However the downside of doing this is then ZeRO does not support loading the checkpoint with a different data parallel (DP) size than what was used to save the checkpoint with.

That would be a problem.

And this is precisely the problem with using BNB and Deepspeed together as of this moment. Once I start training with BNB, I'm stuck with the same DP until the end of the training.

stas00 avatar Dec 01 '21 01:12 stas00

I have an idea. Instead of having code complexity to make the checkpoint elastic, why not make the checkpoint elastic instead?

That is optimize/simplify the code to work with a fixed PP/DP/TP checkpoint and have a mechanism to reconvert the degrees in the checkpoint as a "middle-ware". We already have the foundation tools in https://github.com/microsoft/Megatron-DeepSpeed/tree/main/tools/convert_checkpoint and I have already asked @tjruwase for a new feature to re-mold the checkpoint to a desired PP/DP/TP config (albeit only for weights, here we need optimizer states as well).

So the main difference I propose is that instead of saving the checkpoint to be generic we save it exact, and then have a tool to prepare it for a new PP/DP/TP layout and then launch of that newly converted checkpoint.

Besides, we have an issue with reconfiguring both TP and PP degrees anyway already, albeit PP should be easier to fix I think.

stas00 avatar Dec 01 '21 08:12 stas00

I was thinking also about a potential other problem where the small quantization statistics are also split and communicated. This makes training slower it seems compared to 32-bit Adam. One way to solve this issue and the communication issue is to have a size-dependent "blocklist" for optimizer states. Stas told me that a blocklist already exists for weights. Would this be a solution that is reasonable? With that, all the issues with BNB optimizers could be fixed in one go.

TimDettmers avatar Dec 01 '21 15:12 TimDettmers

Stas told me that a blocklist already exists for weights.

It's not an explicit listing of names, but by param size:

https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training

stage3_param_persistence_threshold: integer Do not partition parameters smaller than this threshold. Smaller values use less memory, but can greatly increase communication (especially latency-bound messages).

stas00 avatar Dec 01 '21 18:12 stas00

Can one of the admins verify this patch?

rocm-mici avatar Jun 09 '22 20:06 rocm-mici

just curious as to the status of this...

Thomas-MMJ avatar Oct 09 '22 19:10 Thomas-MMJ

Hi @stas00 and @TimDettmers - any idea if this is still needed? If so, happy to fix the conflicts and review/merge. Otherwise closing out some older PRs here. Thanks for any context you have!

loadams avatar Sep 14 '23 20:09 loadams

Logan, did Deepspeed add a feature to allow users to configure a list of param names not to be sharded? If I'm not mistaken, based on Tim's OP, this is (was?) the main need for bitsnbytes to work with Deepspeed.

Unless @TimDettmers has already figured out a different solution. It has been quite a while since this Issue was encountered.

stas00 avatar Sep 14 '23 20:09 stas00