Extremly high gpu memory consumption for evluation with custom LLM (over 60 GBs)

Open Johncrtz opened this issue 1 year ago • 4 comments

Hello, i created a testset and run it through my RAG pipeline to get documents and a answer for each question. I now have 50 pairs of [question, ground_truth, documents, answer] that i want get the context_recall from.

Code for my custom LLM:

from torch import cuda, bfloat16 
from torch import nn
import transformers
from transformers import BitsAndBytesConfig
from transformers import LlamaTokenizer
import os
from langchain.llms import HuggingFacePipeline 

model_id = 'meta-llama/Llama-2-7b-chat-hf' #'HuggingFaceH4/zephyr-7b-alpha' 
hf_auth = '...' 
os.environ['OPENAI_API_KEY'] = "..."

bnb_config = transformers.BitsAndBytesConfig( load_in_4bit=True, 
                                             bnb_4bit_quant_type='nf4',
                                             bnb_4bit_use_double_quant=True, 
                                             bnb_4bit_compute_dtype=bfloat16 ) # begin initializing HF items, need auth token for these 

model_config = transformers.AutoConfig.from_pretrained( model_id, token=hf_auth )

model = transformers.AutoModelForCausalLM.from_pretrained(model_id, 
                                                          trust_remote_code=True,
                                                          config=model_config,
                                                          quantization_config=bnb_config,
                                                          device_map="auto",
                                                          token=hf_auth ) 

tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf",token=hf_auth)

generate_text = transformers.pipeline( model=model, tokenizer=tokenizer,
                                      return_full_text=True, # langchain expects the full text
                                      task='text-generation', # we pass model parameters here too 
                                      temperature=0.01, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
                                      max_new_tokens=512,
                                      repetition_penalty=1.1, # without this output begins repeating
                                      use_cache=True,
                                      #num_return_sequences=1,
                                      #eos_token_id=tokenizer.eos_token_id, 
                                      #pad_token_id=tokenizer.eos_token_id, 
                                     )

llm = HuggingFacePipeline(pipeline=generate_text)``

After i do the evaluation:

from ragas import evaluate
from ragas.metrics import (
    context_recall,
)

result = evaluate(
    dataset,
    metrics=[,
        context_recall,
    ],
    llm = llm
)

result

It runs for a while and then ends up with the following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB. GPU 1 has a total capacity of 31.74 GiB of which 29.44 MiB is free. Including non-PyTorch memory, this process has 31.70 GiB memory in use. Of the allocated memory 27.44 GiB is allocated by PyTorch, and 3.65 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I tried to customize pytorch memory config to make it more efficient, this however did no change:

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "backend:native"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "garbage_collection_threshold:0.8"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"

Here you can see my memory allocation:

Device 0: Tesla V100-SXM2-32GB
  Total memory: 31.74 GB
  Allocated memory: 22.81 GB
  Cached memory: 31.10 GB
Device 1: Tesla V100-SXM2-32GB
  Total memory: 31.74 GB
  Allocated memory: 27.59 GB
  Cached memory: 31.11 GB

Is there any way to do the evaluation with a custom LLM and not consume ungodly amounts of memory? imo 50 questions are not too much and i jus expected it to work. Does somone know how to handle this?

Feb 21 '24 09:02 Johncrtz