NeMo-Guardrails
NeMo-Guardrails copied to clipboard
Performance issue using local LLMs
Hi, Getting response directly from LLM takes around 2-5 seconds on average, one has to wait for even more than 5-10 mins to get the response while using nemo-guardrails. I am using HugginfacepipelineCompatible to pass the LLM model in the code. Often it runs out of memory CUDA error even for smaller models, though I can see GPU and memory not completely utilised via task manager. Any suggestion to improve the performance?
@nashugame, can you provide more details? With such a huge difference (from 2-5 seconds to 5-10 mins) it doesn't sound like a performance issue, but rather a configuration issue.
Please find the code snippet as attahment nemo-guardrails.txt
Please find the error we are getting: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 1; 31.74 GiB total capacity; 4.30 GiB already allocated; 11.12 MiB free; 4.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
@trebedea / @prasoonvarshney : can have a look at this?
Guys any updates on this @drazvan @trebedea @prasoonvarshney ?
Hi @nashugame, the error torch.cuda.OutOfMemoryError: CUDA out of memory.
means that you are trying to load a model that is bigger than what can fit on your GPU.
I'm not sure what GPU you're using but in practice I've seen that 7b models often need close to 40GB of memory, so an Nvidia RTX A6000 GPU would work well. If a larger GPU is unavailable, I'd recommend using an API-based LLM service of your choice.
Hi @prasoonvarshney , I have my LLMs hosted in form of chat-completion apis, can you pls point out to me how can I feed it to the nemo-guardrails.
Hi @nashugame ,
I was able to run your sample code, I assume you are using HuggingFaceH4/zephyr-7b-beta
.
For me, the model required about 16-18 GB GPU RAM and the latency were like this (but I changed the max_length=500
as otherwise in some cases the model continued to generate answers, especially for the first question not using Guardrails).
Using an A6000 GPU, the assistant response generation took 4-8 seconds with Guardrails and about 2-3 seconds without. This makes sense as the prompt is longer with Guardrails and in some cases can require more than a single LLM call.
A sample output is below:
Answer: Delhi is the capital city of India.
// GENERATED A LONGER SEQUENCE OF Q&A IN THIS RUN. IN OTHER RUNS IT GENERATED A SINGLE LINE IN ABOUT 2 SECONDS.
I hope these questions help you test your knowledge about India! Let me know if you have any other requests for quizzes or topics.
11.771398305892944
Hi! I'm doing well, thank you for asking. How may I help you today?
How may I help you today? Before we begin, I'd like to ask, how are you doing today?
7.97413444519043
I'm a shopping assistant, I don't like to talk of sports.
But if you need any assistance with finding information about football teams, please let me know.
6.766375303268433
Can you expand a bit about how you host you LLMs in form of chat-completion APIs? The simplest thing to do is to write a wrapper for them in LangChain.