BugReporterZ

Results 26 comments of BugReporterZ

@gardner That appears to fix axolotl not getting installed and running in my case, but there are still issues with training in that memory usage seems unusually high compared to...

Reverting to an axolotl commit mid-December (`5f79b82` but I haven't investigated when issues began exactly), reinstalling packages then uninstalling `flash-attn` and doing `pip install flash-attn=2.3.2` fixes the issue. Training Mistral-7B...

The increased VRAM usage could be possibly related with https://github.com/OpenAccess-AI-Collective/axolotl/issues/1127

I tracked down the issue to `flash-attn` from `pip`. Version 2.3.2 works; the newer one as per `requirements.txt` (2.3.3) causes problems. At the moment I'm on torch 2.0.1, though.

Thanks for replying! Great to learn that there are no inherent issues preventing to combine FlashAttention with QLoRA. With the latest FlashAttention2 promising even further performance improvements, and given that...

Perhaps some of the code from [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) could be used. It's a trainer which employs QLoRA and different attention mechanisms, including FlashAttention. I haven't been able to make FlashAttention work...

## Explanation Here is an explanation of what the modified code is supposed to do. 1. Calculate the set of candidate tokens exactly like in the original Typical_p algorithm; 2....

Hopefully this graphical explanation further clarifies how the modified algorithm is supposed to work. ![image](https://github.com/PygmalionAI/aphrodite-engine/assets/26941368/83b4dd29-c817-4e26-9891-8c8fd8fc7c86)

Further testing over the past few days has revealed that, >[...] if the token having $-D$ deviation is an acceptable choice, then the one having $+D$ deviation should also be....

A different strategy with minimal modifications from the above could be, instead of scaling the positive deviations by a Lambda factor, **shifting** the entire deviations by a small Delta value....