DeepSpeed
DeepSpeed copied to clipboard
Optimizer state loading fix for bitsandbytes 8-bit optimizers.
This is a fix for loading bitsandbytes 8-bit optimizer with optimizer sharding. By default, DeepSpeed shards all tensors held by the optimizer state automatically. However, 8-bit optimizers also hold the quantization datatype (qmap1 and qmap2 in the case of Adam) without which the 8-bit state is undefined. The full datatype is needed to perform quantization and restore the optimizer state and if this tensor is sharded the optimizer state cannot be recovered.
This is a quick fix for the BigScience team. See this PR for more info. A more general fix would be to be able to specify in the config if some tensor keys do not need to be sharded. Alternatively, in this case, it would also suffice to disable sharding for tensors equal or smaller than the size of 256 elements (the size of the 8-bit data type).
To make it generic we need some sort of a black-list of keys not to shard configurable by the user then. Probably exact match would be safer.
additional info: This is ZeRO-1!
@tjruwase, @jeffra
To add a bit more context: I also tried to solve this on the bnb side by casting the tensor to a list before the optimizer is saved, but I could not figure out what the right hook was to trigger the tensor->list conversion. I expected DeepSpeed calls getstate of the optimizer to be used, or state_dict() but neither worked for me. What methods of the optimizer do get called when one saves a sharded optimizer to disk?
I expected DeepSpeed calls getstate of the optimizer to be used, or state_dict() but neither worked for me. What methods of the optimizer do get called when one saves a sharded optimizer to disk?
For ZeRO we should be calling the optimizer's state_dict() method, this pending PR fixes this issue https://github.com/microsoft/DeepSpeed/pull/1525 since we've seen issues in certain cases by not calling the optimizer state_dict like this. However the downside of doing this is then ZeRO does not support loading the checkpoint with a different data parallel (DP) size than what was used to save the checkpoint with. The code currently in master saves the checkpoint in an "elastic" way that strips the padding from the optimizer states so it can be restored with new DP sizes.
However the downside of doing this is then ZeRO does not support loading the checkpoint with a different data parallel (DP) size than what was used to save the checkpoint with.
That would be a problem.
And this is precisely the problem with using BNB and Deepspeed together as of this moment. Once I start training with BNB, I'm stuck with the same DP until the end of the training.
I have an idea. Instead of having code complexity to make the checkpoint elastic, why not make the checkpoint elastic instead?
That is optimize/simplify the code to work with a fixed PP/DP/TP checkpoint and have a mechanism to reconvert the degrees in the checkpoint as a "middle-ware". We already have the foundation tools in https://github.com/microsoft/Megatron-DeepSpeed/tree/main/tools/convert_checkpoint and I have already asked @tjruwase for a new feature to re-mold the checkpoint to a desired PP/DP/TP config (albeit only for weights, here we need optimizer states as well).
So the main difference I propose is that instead of saving the checkpoint to be generic we save it exact, and then have a tool to prepare it for a new PP/DP/TP layout and then launch of that newly converted checkpoint.
Besides, we have an issue with reconfiguring both TP and PP degrees anyway already, albeit PP should be easier to fix I think.
I was thinking also about a potential other problem where the small quantization statistics are also split and communicated. This makes training slower it seems compared to 32-bit Adam. One way to solve this issue and the communication issue is to have a size-dependent "blocklist" for optimizer states. Stas told me that a blocklist already exists for weights. Would this be a solution that is reasonable? With that, all the issues with BNB optimizers could be fixed in one go.
Stas told me that a blocklist already exists for weights.
It's not an explicit listing of names, but by param size:
https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
stage3_param_persistence_threshold: integer Do not partition parameters smaller than this threshold. Smaller values use less memory, but can greatly increase communication (especially latency-bound messages).
Can one of the admins verify this patch?
just curious as to the status of this...
Hi @stas00 and @TimDettmers - any idea if this is still needed? If so, happy to fix the conflicts and review/merge. Otherwise closing out some older PRs here. Thanks for any context you have!
Logan, did Deepspeed add a feature to allow users to configure a list of param names not to be sharded? If I'm not mistaken, based on Tim's OP, this is (was?) the main need for bitsnbytes to work with Deepspeed.
Unless @TimDettmers has already figured out a different solution. It has been quite a while since this Issue was encountered.