Stas Bekman issues

Results 128 issues of


                                            Stas Bekman

[Benchmark] HF Trainer optimizers (Mar-2023)

This is a rerun of Adam torch vs. apex vs HF vs adafactor [RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1005219385), [A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1005220263) but added BNB's 8bit Adam optimizer and probably the software has improved/changed since 14 months...

Benchmarks

Performance

[Benchmarks] index

This issue is to document the important `transformers` benchmarks in one place, so that they are easy to find. To add a new benchmark entry post it in an Issue...

Benchmarks

WIP

[deepspeed docs] Activation Checkpointing

Add a section on Activation Checkpointing. Even though we don't support Deepspeed Activation Checkpointing API nevertheless document it and clarify what's what to help the user achieve clarity and make...

[trainer] fix bug in grad accum with multiple epochs

Please see https://github.com/huggingface/transformers/issues/22082 for the analysis printout of the problem. But basically we have a bug in grad accum machinery when `steps_in_epoch % gradient_accumulation_steps != 0` we always check for...

[deepspeed] offload + non-cpuadam optimizer exception doc

part 2 of https://github.com/huggingface/transformers/pull/22043, but we can't merge it until `deepspeed==0.8.3` is released. This PR documents the new feature and up's the min deepspeed version. **XXX: DO NOT MERGE UNTIL...

[trainer] bug in resume and gas>1

https://github.com/huggingface/transformers/pull/22098 fixed the issue with GAS>1 at the epoch boundary. the same bug will still happens at resume boundary, since `total_batched_samples` is currently reset to 0. So need to save...

fix url post pt-2.0 release

change: https://pytorch.org/docs/2.0/generated/torch.compile.html?highlight=torch+compile#torch.compile to: https://pytorch.org/docs/stable/generated/torch.compile.html?highlight=torch+compile#torch.compile once the latter doc appears post pt-2.0 release for the trainer code here: https://github.com/huggingface/transformers/pull/22140 Actually it looks like all the good stuff is at https://pytorch.org/docs/master/dynamo/index.html -...

[BUG] zero3 memory leak on return from training loop

**Describe the bug** Hmm, I thought I fixed the leak here https://github.com/microsoft/DeepSpeed/pull/2665 but there is one more of a similar nature, where the model gets gathered, but not ungathered when...

bug

training

[comms] fix deprecations

This PR Fixes: ``` /home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be...

[doc] profiling NVMe and configuring `aio` param section

Let's use this issue to gather instructions on how to profile one's CPUNVMe setup. (@tjruwase and I have been editing this post) You need to do this on every new...