llama2gptq
llama2gptq copied to clipboard
Chat to LLaMa 2 that also provides responses with reference documents over vector database. Locally available model using GPTQ 4bit quantization.
LLaMa2 GPTQ
Chat AI which can provide responses with reference documents by Prompt engineering over vector database. It suggests related web pages provided through the integration with my previous product, Texonom.
Pursuing local, private and personal AI without requesting external API attained by optimizing inference performance with GPTQ model quantization. This project was inspired by the langchain projects like notion-qa, localGPT.
Demos
CLI Demo
https://github.com/seonglae/llama2gptq/assets/27716524/dba5cd39-ea5c-44d9-bf29-2e8f04039413
Chat Demo
https://github.com/seonglae/llama2gptq/assets/27716524/258de629-0b61-4670-b76b-9f2357adf4c7
Install
This project is using rye as package manager Currently only available with CUDA
rye sync
or using pip
CUDA_VERSION=cu118
TORCH_VERSION=2.0.1
pip install torch==$TORCH_VERSION --index-url https://download.pytorch.org/whl/$CUDA_VERSION --force
pip install torch==$TORCH_VERSION --index-url https://download.pytorch.org/whl/$CUDA_VERSION
pip install .
QA
1. Chat with Web UI
streamlit run chat.py
2. Chat with CLI
python main.py chat
Ingest Documents
Currently code structure is mainly focussed on Notion's csv exported data
Custom source documents
# Put document files to ./knowledge folder
python main.py process
# Or use provided Texonom DB
git clone https://huggingface.co/datasets/texonom/md-chroma-instructor-xl db
Quantize Model
Default model is orca 3b for now
python main quantize --source_model facebook/opt-125m --output opt-125m-4bit-gptq --push
Future Plan
- [ ] MPS support using dynamic model selecting
- [ ] Stateful Web App support like chat-langchain
App Stack
LLM Stack
- Langchain for Prompt Engineering
- ChromaDB for storing embeddings
- Transformers for LLM engine
- AutoGPTQ for Quantization & Inference