flash-attn is missing from v0.1.12 ?
Hi Darren,
after watching your YouTube demo this afternoon (https://www.youtube.com/watch?v=JNcwAkbsObE),
I tried to run dragon_rag_benchmark_tests_llmware.py script for model=llmware/dragon-deci-7b-v0 on my Ubuntu machine (with the latest __version__ = '0.1.12'),
like to report 2 issues:
flash-attnpkg is missing fromsetup.py- got error:
torch.cuda.OutOfMemoryError: CUDA out of memoryas my GPU has only 8188MiB RAM, but the model itself is 14.1G big. Then I tried model=llmware/dragon-deci-6b-v0, response is slower but no CUDA out-of-memory error.
Question: will llmware release quantized model to reduce memory footprint?
It would be great if llmware can consume Ollama.ai models (e.g. Mistral-7B model is around 4GB)
Thanks Wen
Hi Wen,
Thanks for the feedback and sorry that you have run into some issues. A few comments that may help:
- flash-attn should be installed separately. It is platform-dependent, and may require some configuration of cuda-dev-tools as well.
- cuda out of memory will occur with only 8GB of GPU RAM, unfortunately in trying to run dragon-deci-7b, which generally requires GPU RAM of 20-24 GB (e.g., A10).
- we have published 3 dragon gguf models (dragon-deci coming soon) - if you look in the Examples/Models folder, you will see how to get started. The GGUF files should run well on CPU - as they are already 4-bit quantized.... Please let us know if you have any questions/issues - and look forward to hearing your experience with the GGUF models!
@gongwork - fyi, we do have support now for "open ai chat" compatible models - please check out the example. Hope that you were able to get the dragon gguf models running locally. Please raise another issue if you are still having any problems.