Abed
Abed
I managed to run Vicuna 13b using LLP API and used it in Langchain: I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)...
Did you try [llm-api](https://github.com/1b5d/llm-api) for CPU inference? you can simply just run a docker container and expose the model thought a simple API. You can then use [langchain-llm-api](https://github.com/1b5d/langchain-llm-api) to add...
> > I managed to run Vicuna 13b using LLP API and used it in Langchain: > > I've written an app to run llama based models using docker here:...
Follow up on the comments above: I've recently updated [llm-api](https://github.com/1b5d/llm-api) to be able to run Llama.cpp, GPTQ for Llama or a generic huggingface pipeline. You can easily switch between CPU...
I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) and [llama-cpp](https://github.com/ggerganov/llama.cpp) You can specify the model in the config file, and the app...
There are use case for both local and remote model inference I believe, I want to run my models on a remote server, while others might have enough hardware power...
I've written an app to run llama based models using docker here: https://github.com/1b5d/llm-api thanks to [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) and [llama-cpp](https://github.com/ggerganov/llama.cpp) You can specify the model in the config file, and the app...
Could you please share the configs you are using for this model?
Btw I just built different images for different BLAS backends: - OpenBLAS: 1b5d/llm-api:latest-openblas - cuBLAS: 1b5d/llm-api:latest-cublas - CLBlast: 1b5d/llm-api:latest-clblast - hipBLAS: 1b5d/llm-api:latest-hipblas Could you please let me know if that...
Hey there! thanks for the feedback, the current implementation can only run models in the [ggml](https://github.com/ggerganov/ggml) format to be able to do inference on CPUs using the llama.cpp lib, but...