Extremly high gpu memory consumption for evluation with custom LLM (over 60 GBs)
Hello, i created a testset and run it through my RAG pipeline to get documents and a answer for each question. I now have 50 pairs of [question, ground_truth, documents, answer] that i want get the context_recall from.
Code for my custom LLM:
from torch import cuda, bfloat16
from torch import nn
import transformers
from transformers import BitsAndBytesConfig
from transformers import LlamaTokenizer
import os
from langchain.llms import HuggingFacePipeline
model_id = 'meta-llama/Llama-2-7b-chat-hf' #'HuggingFaceH4/zephyr-7b-alpha'
hf_auth = '...'
os.environ['OPENAI_API_KEY'] = "..."
bnb_config = transformers.BitsAndBytesConfig( load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16 ) # begin initializing HF items, need auth token for these
model_config = transformers.AutoConfig.from_pretrained( model_id, token=hf_auth )
model = transformers.AutoModelForCausalLM.from_pretrained(model_id,
trust_remote_code=True,
config=model_config,
quantization_config=bnb_config,
device_map="auto",
token=hf_auth )
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf",token=hf_auth)
generate_text = transformers.pipeline( model=model, tokenizer=tokenizer,
return_full_text=True, # langchain expects the full text
task='text-generation', # we pass model parameters here too
temperature=0.01, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
max_new_tokens=512,
repetition_penalty=1.1, # without this output begins repeating
use_cache=True,
#num_return_sequences=1,
#eos_token_id=tokenizer.eos_token_id,
#pad_token_id=tokenizer.eos_token_id,
)
llm = HuggingFacePipeline(pipeline=generate_text)``
After i do the evaluation:
from ragas import evaluate
from ragas.metrics import (
context_recall,
)
result = evaluate(
dataset,
metrics=[,
context_recall,
],
llm = llm
)
result
It runs for a while and then ends up with the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB. GPU 1 has a total capacity of 31.74 GiB of which 29.44 MiB is free. Including non-PyTorch memory, this process has 31.70 GiB memory in use. Of the allocated memory 27.44 GiB is allocated by PyTorch, and 3.65 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
I tried to customize pytorch memory config to make it more efficient, this however did no change:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "backend:native"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "garbage_collection_threshold:0.8"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"
Here you can see my memory allocation:
Device 0: Tesla V100-SXM2-32GB
Total memory: 31.74 GB
Allocated memory: 22.81 GB
Cached memory: 31.10 GB
Device 1: Tesla V100-SXM2-32GB
Total memory: 31.74 GB
Allocated memory: 27.59 GB
Cached memory: 31.11 GB
Is there any way to do the evaluation with a custom LLM and not consume ungodly amounts of memory? imo 50 questions are not too much and i jus expected it to work. Does somone know how to handle this?