Stas Bekman issues

Results 128 issues of


                                            Stas Bekman

[REQUEST] sync FusedAdam with the upstream

https://github.com/microsoft/DeepSpeed/blob/43d58d99eba800a809f16efa30a90eb25f7a7003/deepspeed/ops/adam/fused_adam.py#L4-L5 needs to be synced with https://github.com/NVIDIA/apex/blob/master/apex/optimizers/fused_adam.py as since it was forked several fixes were made and BF16 support was added https://github.com/NVIDIA/apex/commits/master/apex/optimizers/fused_adam.py the fork happened at: https://github.com/NVIDIA/apex/commit/a109f85 Sep 28, 2020...

enhancement

[math] what network throughput is required to handle ZeRO-3 traffic?

Given a model size and number of gpus, how can we calculate what kind of throughput should the interconnect network have to handle ZeRO-3 traffic. Is 100Gbps enough? or does...

Inefficient zero checkpoint loading

As it was originally reported in https://github.com/huggingface/transformers/issues/12680 a user can easily train and save checkpoints, but don't have enough RAM to subsequently load that same checkpoint into memory. One discussed...

[BUG] overflow warning needs to be different for fp16 and non-fp16

**Describe the bug** This code has an issue when it is run under non-fp16 regime. https://github.com/microsoft/DeepSpeed/blob/da84e60d98d2e90f6f2094a219c98c8b41582eb9/deepspeed/runtime/zero/stage3.py#L1837-L1842 There are no scalers under bf16/fp32. So this warning is alarming to see -...

bug

training

[trainer] param count for deepspeed zero3

As reported in https://github.com/huggingface/transformers/issues/22179 the trainer code doesn't handle the sharded models correctly in reporting "the Number of trainable parameters" - I'm not sure if FSDP models have the same...

[deepspeed zero3] need `generate(synced_gpus=True, ...)`

As discussed in https://github.com/huggingface/transformers/issues/22231 `generate` under deepspeed zero3 using different input streams on different gpus may hang. It's documented [here](https://huggingface.co/docs/transformers/main/main_classes/deepspeed#custom-deepspeed-zero-inference) and in the API docs, that `synced_gpus=True` is required but...

[feature request] unable to override `dist.init_process_group` timeout in under `zero.Init`

When `zero.Init` is used how can a user override the `timeout` dist init arg? as the dist is inited here: https://github.com/microsoft/DeepSpeed/blob/41a9bde14c808a75452baaa2609681316fc6912b/deepspeed/runtime/zero/partition_parameters.py#L654-L655 Of course the user could do something like this...

should dynamic scaling and overflow check happen only at the beginning?

So `fp16.initial_scale_power` leads to dynamic scaling, except it probably should happen only until it found the right range and never check/go back to scaling again once the right scale has...

[zero_to_fp32] fix shared param recovery

Fixes: https://github.com/microsoft/DeepSpeed/pull/3033 The algorithm to figure out shared params added in https://github.com/microsoft/DeepSpeed/pull/3033 doesn't work as all tensors are placeholders with size 0 and their `data_ptr()` is always 0 and therefore...

[feature request] implement `set_verbosity`

requesting to implement `set_verbosity` as in `transformers` and `datasets` so that the user could control the level of verbosity from `accelerate`. e.g. https://github.com/huggingface/transformers/blob/8c5026628a29a89cf122bc1c95cff8101f78c7c0/src/transformers/utils/logging.py#L149-L165 From previous experience of doing that the...

enhancement

feature request