Stas Bekman issues

Results 128 issues of


                                            Stas Bekman

[REQUEST] universal checkpoint for ZeRO - 1,2,3

**Is your feature request related to a problem? Please describe.** I think we now have all the components ready to do universal checkpoint in ZeRO - 1,2,3, like we had...

enhancement

[BUG] manual build isn't installing requirements

**Describe the bug** I made a fresh conda env and tried to manual build deepspeed and it failed: ``` $ DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v...

bug

[BUG] a huge memory leak when using `register_full_backward_hook`

**Describe the bug** When trying to use `register_full_backward_hook` in Megatron-Deepspeed, I get a huge memory leak. I'm reporting it here, since when I turn off deepspeed, there is no leak....

bug

[BUG] `reduce_bucket_size` isn't validated against the model size

**Describe the bug** When a model is small and the `reduce_bucket_size` is larger this happens: ``` File "/mnt/nvme0/code/huggingface/accelerate-master/src/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn...

bug

training

[BUG] ZeRO++ is broken: `zero_quantized_weights` fails

**Describe the bug** Adding `"zero_quantized_weights": true,` leads to a crash: ``` 35:1]: warnings.warn( [35:1]:Traceback (most recent call last): [35:1]: File "/data/env/lib/repos/retro-llama/tr042-dawn-llama-2/core/dawn/dawn/training/main.py", line 243, in [35:1]: train_logs = trainer.train( [35:1]: File...

bug

training

[BUG] convergence issues with `zero_hpz_partition_size`

**Describe the bug** I wanted to try ZeRO++ and found that using `zero_hpz_partition_size` has convergence issues. The current 0.12.5 version doesn't converge at all Since then I tried https://github.com/microsoft/DeepSpeed/tree/HeyangQin/mixz_hpz_fix and...

bug

training

[docs] explain how to use `torchrun` in a SLURM environment

PL kept on failing to bind to a port in a slurm environment when I tried switching to `torchrun`. I need the latter so that I could use `--role \$(hostname...

docs

community

warnings: resuming before epoch end is absolutely normal for long trainings

### Description & Motivation forking from https://github.com/Lightning-AI/lightning/issues/18723#issuecomment-1751307472 where we were discussing various warnings that don't necessarily apply to all. This issue discusses this warnings: ``` [...]python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py:151: UserWarning: You're resuming from...

feature

data handling

integrate `load_from_disk` into `load_dataset`

**Is your feature request related to a problem? Please describe.** Is it possible to make `load_dataset` more universal similar to `from_pretrained` in `transformers` so that it can handle the hub,...

enhancement

`datasets/downloads` cleanup tool

### Feature request Splitting off https://github.com/huggingface/huggingface_hub/issues/1997 - currently `huggingface-cli delete-cache` doesn't take care of cleaning `datasets` temp files e.g. I discovered having millions of files under `datasets/downloads` cache, I had...

enhancement