lorax
lorax copied to clipboard
marlin
What does this PR do?
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Was this discussed/approved via a Github issue or the discord / slack channel? Please add a link to it if that's the case.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Looks like I need help at debugging @tgaddair The kernel is incompatible with the flash attention kernels Illegal Memory access error occures every time
docker run --pull always -v ./data:/data --gpus all -d --shm-size 1g -p 8080:80 ghcr.io/predibase/lorax:marlin --model-id TheBloke/dolphin-2.6-mistral-7B-dpo-GPTQ --quantize marlin
@tgaddair on the disco research server i read an comment about the incompitability with fused attention Don't have any idea if they want to support it in future or not.
I think without flash attention this feature would not makes much sense because of the much higher memory requirements for longer sequences.
Will keep this PR as draft until it's compatible but won't work actively on it
Thanks @flozi00 , we can hold off until that's supported then.