llmware flash-attn is missing from v0.1.12 ?

Hi Darren, after watching your YouTube demo this afternoon (https://www.youtube.com/watch?v=JNcwAkbsObE), I tried to run dragon_rag_benchmark_tests_llmware.py script for model=llmware/dragon-deci-7b-v0 on my Ubuntu machine (with the latest __version__ = '0.1.12'),
like to report 2 issues:

flash-attn pkg is missing from setup.py
got error: torch.cuda.OutOfMemoryError: CUDA out of memory as my GPU has only 8188MiB RAM, but the model itself is 14.1G big. Then I tried model=llmware/dragon-deci-6b-v0, response is slower but no CUDA out-of-memory error.

Question: will llmware release quantized model to reduce memory footprint? It would be great if llmware can consume Ollama.ai models (e.g. Mistral-7B model is around 4GB)

Thanks Wen

Dec 21 '23 04:12 gongwork

Hi Wen,

Thanks for the feedback and sorry that you have run into some issues. A few comments that may help:

flash-attn should be installed separately. It is platform-dependent, and may require some configuration of cuda-dev-tools as well.
cuda out of memory will occur with only 8GB of GPU RAM, unfortunately in trying to run dragon-deci-7b, which generally requires GPU RAM of 20-24 GB (e.g., A10).
we have published 3 dragon gguf models (dragon-deci coming soon) - if you look in the Examples/Models folder, you will see how to get started. The GGUF files should run well on CPU - as they are already 4-bit quantized.... Please let us know if you have any questions/issues - and look forward to hearing your experience with the GGUF models!

Dec 21 '23 16:12 doberst

@gongwork - fyi, we do have support now for "open ai chat" compatible models - please check out the example. Hope that you were able to get the dragon gguf models running locally. Please raise another issue if you are still having any problems.

Jan 21 '24 10:01 doberst