Fine-tuning LLM for Classification - OutOfMemory
Describe the bug
Followed code examples from the documentation to train Llama2 for sentence classification. I have a T4, so I used (as recommended) lora adapter and 4-bit quantization. I even used a sharded version of Llama2, but I keep getting the following OutOfMemory error.
---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
Cell In[48], line 1
----> 1 train_stats = model.train(dataset=X_train)
File ~/venv/lib/python3.10/site-packages/ludwig/api.py:619, in LudwigModel.train(self, dataset, training_set, validation_set, test_set, training_set_metadata, data_format, experiment_name, model_name, model_resume_path, skip_save_training_description, skip_save_training_statistics, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, random_seed, **kwargs)
616 detected_learning_rate = get_auto_learning_rate(self.config_obj)
617 self.config_obj.trainer.learning_rate = detected_learning_rate
--> 619 with self.backend.create_trainer(
620 model=self.model,
621 config=self.config_obj.trainer,
622 resume=model_resume_path is not None,
623 skip_save_model=skip_save_model,
624 skip_save_progress=skip_save_progress,
625 skip_save_log=skip_save_log,
626 callbacks=train_callbacks,
627 random_seed=random_seed,
628 ) as trainer:
629 # auto tune batch size
630 self._tune_batch_size(trainer, training_set, random_seed=random_seed)
632 if (
633 self.config_obj.model_type == "LLM"
634 and trainer.config.type == "none"
635 and self.config_obj.adapter is not None
636 and self.config_obj.adapter.pretrained_adapter_weights is not None
637 ):
File ~/venv/lib/python3.10/site-packages/ludwig/backend/base.py:293, in LocalBackend.create_trainer(self, config, model, **kwargs)
290 else:
291 trainer_cls = get_from_registry(model.type(), get_trainers_registry())
--> 293 return trainer_cls(config=config, model=model, **kwargs)
File ~/venv/lib/python3.10/site-packages/ludwig/trainers/trainer.py:180, in Trainer.__init__(self, config, model, resume, skip_save_model, skip_save_progress, skip_save_log, callbacks, report_tqdm_to_ray, random_seed, distributed, device, **kwargs)
178 self.model = model
179 self.model.prepare_for_training()
--> 180 self.model = self.distributed.to_device(self.model)
181 self.model.metrics_to_device(self.device)
183 self.compiled_model = self.model
File ~/venv/lib/python3.10/site-packages/ludwig/distributed/base.py:53, in DistributedStrategy.to_device(self, model, device)
52 def to_device(self, model: "BaseModel", device: Optional[torch.device] = None) -> nn.Module:
---> 53 return model.to_device(device if device is not None else get_torch_device())
File ~/venv/lib/python3.10/site-packages/ludwig/models/base.py:63, in BaseModel.to_device(self, device)
62 def to_device(self, device):
---> 63 return self.to(device)
File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1145, in Module.to(self, *args, **kwargs)
1141 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
1142 non_blocking, memory_format=convert_to_format)
1143 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
-> 1145 return self._apply(convert)
File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:797, in Module._apply(self, fn)
795 def _apply(self, fn):
796 for module in self.children():
--> 797 module._apply(fn)
799 def compute_should_use_set_data(tensor, tensor_applied):
800 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
801 # If the new tensor has compatible tensor type as the existing tensor,
802 # the current behavior is to change the tensor in-place using `.data =`,
(...)
807 # global flag to let the user control whether they want the future
808 # behavior of overwriting the existing tensor or not.
File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:797, in Module._apply(self, fn)
795 def _apply(self, fn):
796 for module in self.children():
--> 797 module._apply(fn)
799 def compute_should_use_set_data(tensor, tensor_applied):
800 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
801 # If the new tensor has compatible tensor type as the existing tensor,
802 # the current behavior is to change the tensor in-place using `.data =`,
(...)
807 # global flag to let the user control whether they want the future
808 # behavior of overwriting the existing tensor or not.
[... skipping similar frames: Module._apply at line 797 (9 times)]
File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:797, in Module._apply(self, fn)
795 def _apply(self, fn):
796 for module in self.children():
--> 797 module._apply(fn)
799 def compute_should_use_set_data(tensor, tensor_applied):
800 if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
801 # If the new tensor has compatible tensor type as the existing tensor,
802 # the current behavior is to change the tensor in-place using `.data =`,
(...)
807 # global flag to let the user control whether they want the future
808 # behavior of overwriting the existing tensor or not.
File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:820, in Module._apply(self, fn)
816 # Tensors stored in modules are graph leaves, and we don't want to
817 # track autograd history of `param_applied`, so we have to use
818 # `with torch.no_grad():`
819 with torch.no_grad():
--> 820 param_applied = fn(param)
821 should_use_set_data = compute_should_use_set_data(param, param_applied)
822 if should_use_set_data:
File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1143, in Module.to.<locals>.convert(t)
1140 if convert_to_format is not None and t.dim() in (4, 5):
1141 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
1142 non_blocking, memory_format=convert_to_format)
-> 1143 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 14.58 GiB total capacity; 14.36 GiB already allocated; 103.62 MiB free; 14.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My dataset consists of sentences, and their labels:
NEGATIVE 4906
NEUTRAL 4906
POSITIVE 4906
Here is also the experiment description:
╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛
╒══════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Experiment name │ api_experiment │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Model name │ run │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Output directory │ /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/results/api_experiment_run_10 │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ ludwig_version │ '0.8.6' │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ command │ ('/home/dpasch01/venv/lib/python3.10/site-packages/ipykernel_launcher.py -f ' │
│ │ '/home/dpasch01/.local/share/jupyter/runtime/kernel-b76763c7-3249-4aa6-a362-93c4b9f99d18.json') │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ random_seed │ 42 │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ data_format │ "<class 'pandas.core.frame.DataFrame'>" │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ torch_version │ '2.0.1+cu117' │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ compute │ { 'arch_list': [ 'sm_37', │
│ │ 'sm_50', │
│ │ 'sm_60', │
│ │ 'sm_70', │
│ │ 'sm_75', │
│ │ 'sm_80', │
│ │ 'sm_86'], │
│ │ 'devices': { 0: { 'device_capability': (7, 5), │
│ │ 'device_properties': "_CudaDeviceProperties(name='Tesla " │
│ │ "T4', major=7, minor=5, " │
│ │ 'total_memory=14930MB, ' │
│ │ 'multi_processor_count=40)', │
│ │ 'gpu_type': 'Tesla T4'}}, │
│ │ 'gencode_flags': '-gencode compute=compute_37,code=sm_37 -gencode ' │
│ │ 'compute=compute_50,code=sm_50 -gencode ' │
│ │ 'compute=compute_60,code=sm_60 -gencode ' │
│ │ 'compute=compute_70,code=sm_70 -gencode ' │
│ │ 'compute=compute_75,code=sm_75 -gencode ' │
│ │ 'compute=compute_80,code=sm_80 -gencode ' │
│ │ 'compute=compute_86,code=sm_86', │
│ │ 'gpus_per_node': 1, │
│ │ 'num_nodes': 1} │
╘══════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════════╛
╒═══════════════╕
│ LUDWIG CONFIG │
╘═══════════════╛
User-specified config (with upgrades):
{ 'adapter': {'type': 'lora'},
'input_features': [ { 'encoder': { 'adapter': {'type': 'lora'},
'pretrained_model_name_or_path': 'abhishek/llama-2-7B-hf-small-shards',
'properties': {'type': 'lora'},
'quantization': {'bits': 4},
'trainable': False,
'type': 'auto_transformer'},
'name': 'sentence',
'preprocessing': {'max_sequence_length': 256},
'type': 'text'}],
'ludwig_version': '0.8.6',
'output_features': [{'name': 'attitude', 'type': 'category'}],
'quantization': {'bits': 4},
'trainer': { 'batch_size': 1,
'enable_gradient_checkpointing': True,
'epochs': 3,
'eval_batch_size': 2,
'gradient_accumulation_steps': 16,
'learning_rate': 1e-05}}
Full config saved to:
/home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/results/api_experiment_run_10/api_experiment/model/model_hyperparameters.json
╒═══════════════╕
│ PREPROCESSING │
╘═══════════════╛
No cached dataset found at /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/13802c08852711ee86afdb4d277d9518.training.hdf5. Preprocessing the dataset.
Using full dataframe
Building dataset (it may take a while)
Loaded HuggingFace implementation of abhishek/llama-2-7B-hf-small-shards tokenizer
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Max length of feature 'sentence': 158 (without start and stop symbols)
Setting max length using dataset: 160 (including start and stop symbols)
max sequence length is 160 for feature 'sentence'
Loaded HuggingFace implementation of abhishek/llama-2-7B-hf-small-shards tokenizer
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Building dataset: DONE
Writing preprocessed training set cache to /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/13802c08852711ee86afdb4d277d9518.training.hdf5
Writing preprocessed validation set cache to /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/13802c08852711ee86afdb4d277d9518.validation.hdf5
Writing preprocessed test set cache to /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/13802c08852711ee86afdb4d277d9518.test.hdf5
Writing train set metadata to /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/13802c08852711ee86afdb4d277d9518.meta.json
Dataset Statistics
╒════════════╤═══════════════╤════════════════════╕
│ Dataset │ Size (Rows) │ Size (In Memory) │
╞════════════╪═══════════════╪════════════════════╡
│ Training │ 7212 │ 852.32 Kb │
├────────────┼───────────────┼────────────────────┤
│ Validation │ 1030 │ 121.83 Kb │
├────────────┼───────────────┼────────────────────┤
│ Test │ 2060 │ 243.54 Kb │
╘════════════╧═══════════════╧════════════════════╛
╒═══════╕
│ MODEL │
╘═══════╛
Warnings and other logs:
To Reproduce
config = {
"adapter": {"type": "lora"},
"quantization": {"bits": 4},
"input_features": [
{
"name": "sentence",
"type": "text",
"preprocessing": {"max_sequence_length": 256},
"encoder": {
"type": "auto_transformer",
#########################################################################
# "pretrained_model_name_or_path": "TinyPixel/Llama-2-7B-bf16-sharded", #
#########################################################################
"pretrained_model_name_or_path": "abhishek/llama-2-7B-hf-small-shards",
"trainable": False,
"properties" : {"type": "lora"},
"adapter": {"type": "lora"},
"quantization": {"bits": 4}
}
}
],
"output_features": [
{
"name": "sentiment",
"type": "category",
}
],
"trainer": {
"epochs": 3,
"learning_rate": 0.00001,
"batch_size": 1,
"eval_batch_size": 2,
"gradient_accumulation_steps": 16,
"enable_gradient_checkpointing": True
}
}
import logging
from ludwig.api import LudwigModel
model = LudwigModel(config=config, logging_level=logging.INFO)
train_stats = model.train(dataset=X_train)
Expected behavior In my understanding, this should be able to load and train the sharded Llama2 in a T4 of 16GB of VRAM.
Environment (please complete the following information):
- OS: Ubuntu
- Version: 22
- Python version: 2.9
- Ludwig version: 0.8
Hello, @dpasch01 and thank you for filing this issue. Could you please confirm your Python version -- is it really 2.9? Also, could you please share your entire configuration, and we will take a look. Thank you very much.
Hello,
Sorry about that. The virtual environment is on Python 3.10.12. Also, what do you mean by "entire configuration"? I am running this on a server with 80 cores of Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz, 64GB or RAM, and a 16GB Tesla T4 GPU.
I have a Jupyter notebook running, with the majority of code is to prepare the dataset. And the only part I have the above error is the following:
import logging
from ludwig.api import LudwigModel
model = LudwigModel(config=config, logging_level=logging.INFO)
train_stats = model.train(dataset=X_train)
Hi @dpasch01 -- thank you for that information. In addition, I was also hoping to see your Ludwig configuration -- the config that you are passing to LudwigModel. Thank you.
Hello @alexsherstinsky,
I am not passing a file. I followed the examples in the website, like in this colab, where the config is passed as a Python dictionary similar to the one I provided above. Here it is again:
config = {
"adapter": {"type": "lora"},
"quantization": {"bits": 4},
"input_features": [
{
"name": "sentence",
"type": "text",
"preprocessing": {"max_sequence_length": 256},
"encoder": {
"type": "auto_transformer",
#########################################################################
# "pretrained_model_name_or_path": "TinyPixel/Llama-2-7B-bf16-sharded", #
#########################################################################
"pretrained_model_name_or_path": "abhishek/llama-2-7B-hf-small-shards",
"trainable": False,
"properties" : {"type": "lora"},
"adapter": {"type": "lora"},
"quantization": {"bits": 4}
}
}
],
"output_features": [
{
"name": "sentiment",
"type": "category",
}
],
"trainer": {
"epochs": 3,
"learning_rate": 0.00001,
"batch_size": 1,
"eval_batch_size": 2,
"gradient_accumulation_steps": 16,
"enable_gradient_checkpointing": True
}
}
import logging
from ludwig.api import LudwigModel
model = LudwigModel(config=config, logging_level=logging.INFO)
train_stats = model.train(dataset=X_train)
@dpasch01 Thank you -- no problem regarding not passing a file. Looking into it now, and will get back to you within a couple of business days. Thanks.
Hello again, @dpasch01 -- For an LLM fine-tuning, I was looking for such entries as:
"model_type": "llm",
"base_model": "abhishek/llama-2-7B-hf-small-shards",
In your configuration dictionary, but am not seeing them.
How about removing the encoder section entirely and adding what I entered above and trying again?
Here is an example configuration, which you can customize for your own use case:
qlora_fine_tuning_config: dict = yaml.safe_load(
"""
model_type: llm
base_model: abhishek/llama-2-7B-hf-small-shards
input_features:
- name: sentence
type: text
preprocessing:
max_sequence_length: 1024
output_features:
- name: sentiment
type: text
preprocessing:
max_sequence_length: 384
prompt:
template: >-
Summarize the issue/question found in the input text:
### Transcript: {sentence}
### Task Type:
generation:
temperature: 0.1
max_new_tokens: 512
adapter:
type: lora
quantization:
bits: 4
preprocessing:
split:
# type: random
# probabilities: [0.9, 0.05, 0.05]
type: fixed
trainer:
type: finetune
epochs: 3
batch_size: 1
eval_batch_size: 2
gradient_accumulation_steps: 16 # effective batch size = batch size * gradient_accumulation_steps
learning_rate: 2.0e-4
enable_gradient_checkpointing: true
learning_rate_scheduler:
decay: cosine
warmup_fraction: 0.03
reduce_on_plateau: 0
"""
)
This would phrase your problem into a text-to-text fine-tuning.
Of importance, please make sure that you specify the preprocessing section correctly. Please read the documentation for what "split" column values are expected in the dataset; alternatively, uncomment the above values (for "random") and comment out "fixed". Finally, the prompt text is very important for your use case -- so please edit it accordingly.
If these ideas sound good to you, please try them and let us know. Thank you.
Hello and thank you for your quick response.
I saw this solution for a text-to-text fine-tuning. However, my goal is to leverage the potency of Llama2 for a unique case of sentiment analysis, which is a multi-class classification (0 for Neutral, 1 for Negative, and 2 for Positive).
Wouldn't a text-to-text approach overcomplicate the training and inference, and significantly raise the resources needed?
I was more interested in your LLM classification approach from here:
input_features:
- name: review
type: text
encoder:
type: auto_transformer
pretrained_model_name_or_path: meta-llama/Llama-2-7b-hf
trainable: true
adapter: lora
output_features:
- name: sentiment
type: category
However, it still throws an error of OutOfMemory.
@dpasch01 I would have to investigate this further -- could you please confirm that you are referring to the very last example on the classification page, which you linked to in your previous message, and in the notebook included as part of that document? If so, I will need a few days as I would have to run that example (and it may require multiple GPUs). In the mean time, would you be willing to try the approach I defined? It should run in the free Google Colab notebook, and we have many examples. To the last point -- we are now running a contest: https://predibase.com/blog/announcing-the-ludwig-10k-giveaway-competition -- please join if you can! Thank you!
Thank you very much! I will try to run the text-to-text approach with the snippet provided. I don't have access to multiple GPUs, I am more interested to why fine-tuning the whole text-to-text Llama2 fits in memory, while for just the encoder part it doesn't.
Will try to enter the competition! Thank you!
@dpasch01 Well, that is part of the complexity of the situation -- the documentation calls for multiple GPUs (potentially), so I would have to work together with a colleague to work that example thoroughly to figure out what should work and exactly how. In the mean time, the LLM fine-tuning approach will work for certain. I will get back to you herein for this case; in the mean time, see you in the Ludwig Slack for the competition! Thank you for using Ludwig!
@dpasch01 Following up on this. I do not think that the entire meta-llama/Llama-2-7b-hf would fit into the memory of a single commodity GPU (like T4) along with the required extra memory for training and inference. The examples in our documentation are using much smaller, 350M parameter models (like facebook/opt-350m). Please let me know what would make sense for you to do in order to continue? Otherwise, please close this issue. Thank you.
Hey @alexsherstinsky. This is understandable, however, I am using the sharded version of abhishek/llama-2-7B-hf-small-shards. Shouldn't this work?
Hi @dpasch01! I believe the issue here is that we don't support quantisation for ECD model types, which is what you're trying to train. We only support quantisation for LLM model types, where you can only train text to text models.
So, while it is in your config, it's not actually being used. On our end, we can add some more checks to flag this to user so they know (cc: @alexsherstinsky).
The net effect here is that your model is getting loaded in fp16, which itself would take about 13Gb of GPU memory (and that far exceeds T4 on Colab, which only has 12Gb I think). So 13Gb for the base model, followed by the memory for LoRA weights. Then, you need head room for your batch size + gradients, which will scale dramatically for a model of this size since each sample in the batch will need O(7B) amount of memory as well, assuming 1 byte per activation per parameter. All of this would easily push the memory over.
Hope this helps!
Thanks for your response @arnavgarg1.
A flag would be nice yes. I will try with a lighter model as @alexsherstinsky said, just to see what happens.