Stas Bekman

Results 128 issues of Stas Bekman

This is a rerun of Adam torch vs. apex vs HF vs adafactor [RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1005219385), [A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1005220263) but added BNB's 8bit Adam optimizer and probably the software has improved/changed since 14 months...

Benchmarks
Performance

This issue is to document the important `transformers` benchmarks in one place, so that they are easy to find. To add a new benchmark entry post it in an Issue...

Benchmarks
WIP

Add a section on Activation Checkpointing. Even though we don't support Deepspeed Activation Checkpointing API nevertheless document it and clarify what's what to help the user achieve clarity and make...

Please see https://github.com/huggingface/transformers/issues/22082 for the analysis printout of the problem. But basically we have a bug in grad accum machinery when `steps_in_epoch % gradient_accumulation_steps != 0` we always check for...

part 2 of https://github.com/huggingface/transformers/pull/22043, but we can't merge it until `deepspeed==0.8.3` is released. This PR documents the new feature and up's the min deepspeed version. **XXX: DO NOT MERGE UNTIL...

https://github.com/huggingface/transformers/pull/22098 fixed the issue with GAS>1 at the epoch boundary. the same bug will still happens at resume boundary, since `total_batched_samples` is currently reset to 0. So need to save...

change: https://pytorch.org/docs/2.0/generated/torch.compile.html?highlight=torch+compile#torch.compile to: https://pytorch.org/docs/stable/generated/torch.compile.html?highlight=torch+compile#torch.compile once the latter doc appears post pt-2.0 release for the trainer code here: https://github.com/huggingface/transformers/pull/22140 Actually it looks like all the good stuff is at https://pytorch.org/docs/master/dynamo/index.html -...

**Describe the bug** Hmm, I thought I fixed the leak here https://github.com/microsoft/DeepSpeed/pull/2665 but there is one more of a similar nature, where the model gets gathered, but not ungathered when...

bug
training

This PR Fixes: ``` /home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:2849: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be...

Let's use this issue to gather instructions on how to profile one's CPUNVMe setup. (@tjruwase and I have been editing this post) You need to do this on every new...