ludwig Fine-tuning LLM for Classification

Describe the bug Followed code examples from the documentation to train Llama2 for sentence classification. I have a T4, so I used (as recommended) lora adapter and 4-bit quantization. I even used a sharded version of Llama2, but I keep getting the following OutOfMemory error.

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[48], line 1
----> 1 train_stats = model.train(dataset=X_train)

File ~/venv/lib/python3.10/site-packages/ludwig/api.py:619, in LudwigModel.train(self, dataset, training_set, validation_set, test_set, training_set_metadata, data_format, experiment_name, model_name, model_resume_path, skip_save_training_description, skip_save_training_statistics, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, random_seed, **kwargs)
    616     detected_learning_rate = get_auto_learning_rate(self.config_obj)
    617     self.config_obj.trainer.learning_rate = detected_learning_rate
--> 619 with self.backend.create_trainer(
    620     model=self.model,
    621     config=self.config_obj.trainer,
    622     resume=model_resume_path is not None,
    623     skip_save_model=skip_save_model,
    624     skip_save_progress=skip_save_progress,
    625     skip_save_log=skip_save_log,
    626     callbacks=train_callbacks,
    627     random_seed=random_seed,
    628 ) as trainer:
    629     # auto tune batch size
    630     self._tune_batch_size(trainer, training_set, random_seed=random_seed)
    632     if (
    633         self.config_obj.model_type == "LLM"
    634         and trainer.config.type == "none"
    635         and self.config_obj.adapter is not None
    636         and self.config_obj.adapter.pretrained_adapter_weights is not None
    637     ):

File ~/venv/lib/python3.10/site-packages/ludwig/backend/base.py:293, in LocalBackend.create_trainer(self, config, model, **kwargs)
    290 else:
    291     trainer_cls = get_from_registry(model.type(), get_trainers_registry())
--> 293 return trainer_cls(config=config, model=model, **kwargs)

File ~/venv/lib/python3.10/site-packages/ludwig/trainers/trainer.py:180, in Trainer.__init__(self, config, model, resume, skip_save_model, skip_save_progress, skip_save_log, callbacks, report_tqdm_to_ray, random_seed, distributed, device, **kwargs)
    178 self.model = model
    179 self.model.prepare_for_training()
--> 180 self.model = self.distributed.to_device(self.model)
    181 self.model.metrics_to_device(self.device)
    183 self.compiled_model = self.model

File ~/venv/lib/python3.10/site-packages/ludwig/distributed/base.py:53, in DistributedStrategy.to_device(self, model, device)
     52 def to_device(self, model: "BaseModel", device: Optional[torch.device] = None) -> nn.Module:
---> 53     return model.to_device(device if device is not None else get_torch_device())

File ~/venv/lib/python3.10/site-packages/ludwig/models/base.py:63, in BaseModel.to_device(self, device)
     62 def to_device(self, device):
---> 63     return self.to(device)

File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1145, in Module.to(self, *args, **kwargs)
   1141         return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1142                     non_blocking, memory_format=convert_to_format)
   1143     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
-> 1145 return self._apply(convert)

File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:797, in Module._apply(self, fn)
    795 def _apply(self, fn):
    796     for module in self.children():
--> 797         module._apply(fn)
    799     def compute_should_use_set_data(tensor, tensor_applied):
    800         if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    801             # If the new tensor has compatible tensor type as the existing tensor,
    802             # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    807             # global flag to let the user control whether they want the future
    808             # behavior of overwriting the existing tensor or not.

File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:797, in Module._apply(self, fn)
    795 def _apply(self, fn):
    796     for module in self.children():
--> 797         module._apply(fn)
    799     def compute_should_use_set_data(tensor, tensor_applied):
    800         if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    801             # If the new tensor has compatible tensor type as the existing tensor,
    802             # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    807             # global flag to let the user control whether they want the future
    808             # behavior of overwriting the existing tensor or not.

    [... skipping similar frames: Module._apply at line 797 (9 times)]

File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:797, in Module._apply(self, fn)
    795 def _apply(self, fn):
    796     for module in self.children():
--> 797         module._apply(fn)
    799     def compute_should_use_set_data(tensor, tensor_applied):
    800         if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    801             # If the new tensor has compatible tensor type as the existing tensor,
    802             # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    807             # global flag to let the user control whether they want the future
    808             # behavior of overwriting the existing tensor or not.

File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:820, in Module._apply(self, fn)
    816 # Tensors stored in modules are graph leaves, and we don't want to
    817 # track autograd history of `param_applied`, so we have to use
    818 # `with torch.no_grad():`
    819 with torch.no_grad():
--> 820     param_applied = fn(param)
    821 should_use_set_data = compute_should_use_set_data(param, param_applied)
    822 if should_use_set_data:

File ~/venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1143, in Module.to.<locals>.convert(t)
   1140 if convert_to_format is not None and t.dim() in (4, 5):
   1141     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None,
   1142                 non_blocking, memory_format=convert_to_format)
-> 1143 return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 14.58 GiB total capacity; 14.36 GiB already allocated; 103.62 MiB free; 14.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My dataset consists of sentences, and their labels:

NEGATIVE    4906
NEUTRAL     4906
POSITIVE    4906

Here is also the experiment description:


╒════════════════════════╕
│ EXPERIMENT DESCRIPTION │
╘════════════════════════╛

╒══════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════════╕
│ Experiment name  │ api_experiment                                                                                   │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Model name       │ run                                                                                              │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Output directory │ /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/results/api_experiment_run_10     │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ ludwig_version   │ '0.8.6'                                                                                          │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ command          │ ('/home/dpasch01/venv/lib/python3.10/site-packages/ipykernel_launcher.py -f '                    │
│                  │  '/home/dpasch01/.local/share/jupyter/runtime/kernel-b76763c7-3249-4aa6-a362-93c4b9f99d18.json') │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ random_seed      │ 42                                                                                               │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ data_format      │ "<class 'pandas.core.frame.DataFrame'>"                                                          │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ torch_version    │ '2.0.1+cu117'                                                                                    │
├──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
│ compute          │ {   'arch_list': [   'sm_37',                                                                    │
│                  │                      'sm_50',                                                                    │
│                  │                      'sm_60',                                                                    │
│                  │                      'sm_70',                                                                    │
│                  │                      'sm_75',                                                                    │
│                  │                      'sm_80',                                                                    │
│                  │                      'sm_86'],                                                                   │
│                  │     'devices': {   0: {   'device_capability': (7, 5),                                           │
│                  │                           'device_properties': "_CudaDeviceProperties(name='Tesla "              │
│                  │                                                "T4', major=7, minor=5, "                         │
│                  │                                                'total_memory=14930MB, '                          │
│                  │                                                'multi_processor_count=40)',                      │
│                  │                           'gpu_type': 'Tesla T4'}},                                              │
│                  │     'gencode_flags': '-gencode compute=compute_37,code=sm_37 -gencode '                          │
│                  │                      'compute=compute_50,code=sm_50 -gencode '                                   │
│                  │                      'compute=compute_60,code=sm_60 -gencode '                                   │
│                  │                      'compute=compute_70,code=sm_70 -gencode '                                   │
│                  │                      'compute=compute_75,code=sm_75 -gencode '                                   │
│                  │                      'compute=compute_80,code=sm_80 -gencode '                                   │
│                  │                      'compute=compute_86,code=sm_86',                                            │
│                  │     'gpus_per_node': 1,                                                                          │
│                  │     'num_nodes': 1}                                                                              │
╘══════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════════╛

╒═══════════════╕
│ LUDWIG CONFIG │
╘═══════════════╛

User-specified config (with upgrades):

{   'adapter': {'type': 'lora'},
    'input_features': [   {   'encoder': {   'adapter': {'type': 'lora'},
                                             'pretrained_model_name_or_path': 'abhishek/llama-2-7B-hf-small-shards',
                                             'properties': {'type': 'lora'},
                                             'quantization': {'bits': 4},
                                             'trainable': False,
                                             'type': 'auto_transformer'},
                              'name': 'sentence',
                              'preprocessing': {'max_sequence_length': 256},
                              'type': 'text'}],
    'ludwig_version': '0.8.6',
    'output_features': [{'name': 'attitude', 'type': 'category'}],
    'quantization': {'bits': 4},
    'trainer': {   'batch_size': 1,
                   'enable_gradient_checkpointing': True,
                   'epochs': 3,
                   'eval_batch_size': 2,
                   'gradient_accumulation_steps': 16,
                   'learning_rate': 1e-05}}

Full config saved to:
/home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/results/api_experiment_run_10/api_experiment/model/model_hyperparameters.json

╒═══════════════╕
│ PREPROCESSING │
╘═══════════════╛

No cached dataset found at /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/13802c08852711ee86afdb4d277d9518.training.hdf5. Preprocessing the dataset.
Using full dataframe
Building dataset (it may take a while)
Loaded HuggingFace implementation of abhishek/llama-2-7B-hf-small-shards tokenizer

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

Max length of feature 'sentence': 158 (without start and stop symbols)
Setting max length using dataset: 160 (including start and stop symbols)
max sequence length is 160 for feature 'sentence'
Loaded HuggingFace implementation of abhishek/llama-2-7B-hf-small-shards tokenizer

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

Building dataset: DONE
Writing preprocessed training set cache to /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/13802c08852711ee86afdb4d277d9518.training.hdf5
Writing preprocessed validation set cache to /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/13802c08852711ee86afdb4d277d9518.validation.hdf5
Writing preprocessed test set cache to /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/13802c08852711ee86afdb4d277d9518.test.hdf5
Writing train set metadata to /home/dpasch01/notebooks/Sentiment Attitude Classification/LLM/13802c08852711ee86afdb4d277d9518.meta.json

Dataset Statistics
╒════════════╤═══════════════╤════════════════════╕
│ Dataset    │   Size (Rows) │ Size (In Memory)   │
╞════════════╪═══════════════╪════════════════════╡
│ Training   │          7212 │ 852.32 Kb          │
├────────────┼───────────────┼────────────────────┤
│ Validation │          1030 │ 121.83 Kb          │
├────────────┼───────────────┼────────────────────┤
│ Test       │          2060 │ 243.54 Kb          │
╘════════════╧═══════════════╧════════════════════╛

╒═══════╕
│ MODEL │
╘═══════╛

Warnings and other logs:

To Reproduce


config = {
  "adapter":        {"type": "lora"},
  "quantization":   {"bits": 4},
  "input_features": [
    {
      "name":          "sentence",  
      "type":          "text",  
      "preprocessing": {"max_sequence_length": 256},
      "encoder": {
            "type":                          "auto_transformer",
          
            #########################################################################
            # "pretrained_model_name_or_path": "TinyPixel/Llama-2-7B-bf16-sharded", #
            #########################################################################

            "pretrained_model_name_or_path": "abhishek/llama-2-7B-hf-small-shards",
          
            "trainable":                     False,
            "properties" :                   {"type": "lora"},
            "adapter":                       {"type": "lora"},
            "quantization":                  {"bits": 4}
       }
    }
  ],
  "output_features": [
    {
      "name": "sentiment",
      "type": "category",
    }
  ],
  "trainer": {
    "epochs":                        3,
    "learning_rate":                 0.00001,
    "batch_size":                    1,
    "eval_batch_size":               2,
    "gradient_accumulation_steps":   16,
    "enable_gradient_checkpointing": True
  }
}

import logging
from ludwig.api import LudwigModel

model = LudwigModel(config=config, logging_level=logging.INFO)

train_stats = model.train(dataset=X_train)

Expected behavior In my understanding, this should be able to load and train the sharded Llama2 in a T4 of 16GB of VRAM.

Environment (please complete the following information):

OS: Ubuntu
Version: 22
Python version: 2.9
Ludwig version: 0.8

Nov 17 '23 09:11 dpasch01

Hello, @dpasch01 and thank you for filing this issue. Could you please confirm your Python version -- is it really 2.9? Also, could you please share your entire configuration, and we will take a look. Thank you very much.

Nov 17 '23 09:11 alexsherstinsky

Hello,

Sorry about that. The virtual environment is on Python 3.10.12. Also, what do you mean by "entire configuration"? I am running this on a server with 80 cores of Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz, 64GB or RAM, and a 16GB Tesla T4 GPU.

I have a Jupyter notebook running, with the majority of code is to prepare the dataset. And the only part I have the above error is the following:


import logging
from ludwig.api import LudwigModel

model = LudwigModel(config=config, logging_level=logging.INFO)

train_stats = model.train(dataset=X_train)

Nov 17 '23 09:11 dpasch01

Hi @dpasch01 -- thank you for that information. In addition, I was also hoping to see your Ludwig configuration -- the config that you are passing to LudwigModel. Thank you.

Nov 17 '23 15:11 alexsherstinsky

Hello @alexsherstinsky,

I am not passing a file. I followed the examples in the website, like in this colab, where the config is passed as a Python dictionary similar to the one I provided above. Here it is again:


config = {
  "adapter":        {"type": "lora"},
  "quantization":   {"bits": 4},
  "input_features": [
    {
      "name":          "sentence",  
      "type":          "text",  
      "preprocessing": {"max_sequence_length": 256},
      "encoder": {
            "type":                          "auto_transformer",
          
            #########################################################################
            # "pretrained_model_name_or_path": "TinyPixel/Llama-2-7B-bf16-sharded", #
            #########################################################################

            "pretrained_model_name_or_path": "abhishek/llama-2-7B-hf-small-shards",
          
            "trainable":                     False,
            "properties" :                   {"type": "lora"},
            "adapter":                       {"type": "lora"},
            "quantization":                  {"bits": 4}
       }
    }
  ],
  "output_features": [
    {
      "name": "sentiment",
      "type": "category",
    }
  ],
  "trainer": {
    "epochs":                        3,
    "learning_rate":                 0.00001,
    "batch_size":                    1,
    "eval_batch_size":               2,
    "gradient_accumulation_steps":   16,
    "enable_gradient_checkpointing": True
  }
}

import logging
from ludwig.api import LudwigModel

model = LudwigModel(config=config, logging_level=logging.INFO)

train_stats = model.train(dataset=X_train)

Nov 17 '23 15:11 dpasch01

@dpasch01 Thank you -- no problem regarding not passing a file. Looking into it now, and will get back to you within a couple of business days. Thanks.

Nov 17 '23 15:11 alexsherstinsky

Hello again, @dpasch01 -- For an LLM fine-tuning, I was looking for such entries as:

"model_type": "llm",
"base_model": "abhishek/llama-2-7B-hf-small-shards",

In your configuration dictionary, but am not seeing them.

How about removing the encoder section entirely and adding what I entered above and trying again?

Here is an example configuration, which you can customize for your own use case:

qlora_fine_tuning_config: dict = yaml.safe_load(
"""
model_type: llm
base_model: abhishek/llama-2-7B-hf-small-shards

input_features:
  - name: sentence
    type: text
    preprocessing:
      max_sequence_length: 1024

output_features:
  - name: sentiment
    type: text
    preprocessing:
      max_sequence_length: 384

prompt:
  template: >-
    Summarize the issue/question found in the input text:

    ### Transcript: {sentence}

    ### Task Type:

generation:
  temperature: 0.1
  max_new_tokens: 512

adapter:
  type: lora

quantization:
  bits: 4

preprocessing:
  split:
    # type: random
    # probabilities: [0.9, 0.05, 0.05]
    type: fixed

trainer:
  type: finetune
  epochs: 3
  batch_size: 1
  eval_batch_size: 2
  gradient_accumulation_steps: 16  # effective batch size = batch size * gradient_accumulation_steps
  learning_rate: 2.0e-4
  enable_gradient_checkpointing: true
  learning_rate_scheduler:
    decay: cosine
    warmup_fraction: 0.03
    reduce_on_plateau: 0
"""
)

This would phrase your problem into a text-to-text fine-tuning.

Of importance, please make sure that you specify the preprocessing section correctly. Please read the documentation for what "split" column values are expected in the dataset; alternatively, uncomment the above values (for "random") and comment out "fixed". Finally, the prompt text is very important for your use case -- so please edit it accordingly.

If these ideas sound good to you, please try them and let us know. Thank you.

Nov 17 '23 15:11 alexsherstinsky

Hello and thank you for your quick response.

I saw this solution for a text-to-text fine-tuning. However, my goal is to leverage the potency of Llama2 for a unique case of sentiment analysis, which is a multi-class classification (0 for Neutral, 1 for Negative, and 2 for Positive).

Wouldn't a text-to-text approach overcomplicate the training and inference, and significantly raise the resources needed?

I was more interested in your LLM classification approach from here:


input_features:
 - name: review
   type: text
   encoder:
     type: auto_transformer
     pretrained_model_name_or_path: meta-llama/Llama-2-7b-hf
     trainable: true
     adapter: lora


output_features:
 - name: sentiment
   type: category

However, it still throws an error of OutOfMemory.

Nov 17 '23 15:11 dpasch01

@dpasch01 I would have to investigate this further -- could you please confirm that you are referring to the very last example on the classification page, which you linked to in your previous message, and in the notebook included as part of that document? If so, I will need a few days as I would have to run that example (and it may require multiple GPUs). In the mean time, would you be willing to try the approach I defined? It should run in the free Google Colab notebook, and we have many examples. To the last point -- we are now running a contest: https://predibase.com/blog/announcing-the-ludwig-10k-giveaway-competition -- please join if you can! Thank you!

Nov 17 '23 16:11 alexsherstinsky

Thank you very much! I will try to run the text-to-text approach with the snippet provided. I don't have access to multiple GPUs, I am more interested to why fine-tuning the whole text-to-text Llama2 fits in memory, while for just the encoder part it doesn't.

Will try to enter the competition! Thank you!

Nov 17 '23 16:11 dpasch01

@dpasch01 Well, that is part of the complexity of the situation -- the documentation calls for multiple GPUs (potentially), so I would have to work together with a colleague to work that example thoroughly to figure out what should work and exactly how. In the mean time, the LLM fine-tuning approach will work for certain. I will get back to you herein for this case; in the mean time, see you in the Ludwig Slack for the competition! Thank you for using Ludwig!

Nov 17 '23 16:11 alexsherstinsky

@dpasch01 Following up on this. I do not think that the entire meta-llama/Llama-2-7b-hf would fit into the memory of a single commodity GPU (like T4) along with the required extra memory for training and inference. The examples in our documentation are using much smaller, 350M parameter models (like facebook/opt-350m). Please let me know what would make sense for you to do in order to continue? Otherwise, please close this issue. Thank you.

Dec 06 '23 20:12 alexsherstinsky

Hey @alexsherstinsky. This is understandable, however, I am using the sharded version of abhishek/llama-2-7B-hf-small-shards. Shouldn't this work?

Dec 12 '23 13:12 dpasch01

Hi @dpasch01! I believe the issue here is that we don't support quantisation for ECD model types, which is what you're trying to train. We only support quantisation for LLM model types, where you can only train text to text models.

So, while it is in your config, it's not actually being used. On our end, we can add some more checks to flag this to user so they know (cc: @alexsherstinsky).

The net effect here is that your model is getting loaded in fp16, which itself would take about 13Gb of GPU memory (and that far exceeds T4 on Colab, which only has 12Gb I think). So 13Gb for the base model, followed by the memory for LoRA weights. Then, you need head room for your batch size + gradients, which will scale dramatically for a model of this size since each sample in the batch will need O(7B) amount of memory as well, assuming 1 byte per activation per parameter. All of this would easily push the memory over.

Hope this helps!

Dec 12 '23 13:12 arnavgarg1

Thanks for your response @arnavgarg1.

A flag would be nice yes. I will try with a lighter model as @alexsherstinsky said, just to see what happens.

Dec 12 '23 13:12 dpasch01

Fine-tuning LLM for Classification - OutOfMemory