axolotl QLoRa Fine Tuning Mixtral 8x7b or 34b models OOM on 2x24GB Titan RTX

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I would expect that training Mixtral 8x7b or 34b/33b models should fit within the VRAM of 2x24GB cards. As I thought Mixtral 8x7b on QLoRa 4-bit fine tuning would only require 32GB and multi GPU can use deepspeed. At least that's what LLaMa-Factory claims and therefore I suppose works with their repo? https://github.com/hiyouga/LLaMA-Factory#hardware-requirement

Current behaviour

Currently flash attention 2 is not supported on Turing GPUs that my Titan RTX are so I have to use xformers. When I run the training even with minimal sequence length and micro batch the GPUs memory fills up at the same time and OOM at the beginning. Is this a behaviour that is expected due to not using flash attention? Deepspeed Zero2 and Zero3 both still OOM and ofcourse no deepspeed also OOM. I am running on Windows 10 WSL2 Ubuntu.

(axolotl) owen@COMPUTE-PC:~/train-mixtral$ accelerate launch -m axolotl.cli.train qlora-mixtral-sunda.yml
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `2`
                More than one GPU was found, enabling multi-GPU training.
                If this was unintended please pass in `--num_processes=1`.
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2023-12-23 20:08:48,367] [INFO] [datasets.<module>:58] [PID:2954421] PyTorch version 2.1.1+cu121 available.
[2023-12-23 20:08:48,368] [INFO] [datasets.<module>:58] [PID:2954422] PyTorch version 2.1.1+cu121 available.
[2023-12-23 20:08:49,402] [WARNING] [axolotl.validate_config:250] [PID:2954421] [RANK:0] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
[2023-12-23 20:08:49,408] [INFO] [axolotl.normalize_config:150] [PID:2954421] [RANK:0] GPU memory usage baseline: 0.000GB (+0.648GB misc)
[2023-12-23 20:08:49,417] [WARNING] [axolotl.validate_config:250] [PID:2954422] [RANK:1] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
[2023-12-23 20:08:49,419] [INFO] [axolotl.normalize_config:150] [PID:2954422] [RANK:1] GPU memory usage baseline: 0.000GB (+0.300GB misc)
[2023-12-23 20:08:49,423] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-23 20:08:49,435] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-23 20:08:49,774] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-23 20:08:49,774] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
                                 dP            dP   dP
                                 88            88   88
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP



[2023-12-23 20:08:49,776] [WARNING] [axolotl.scripts.check_user_token:358] [PID:2954421] [RANK:0] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2023-12-23 20:08:49,785] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-23 20:08:49,971] [DEBUG] [axolotl.load_tokenizer:184] [PID:2954421] [RANK:0] EOS: 2 / </s>
[2023-12-23 20:08:49,971] [DEBUG] [axolotl.load_tokenizer:185] [PID:2954421] [RANK:0] BOS: 1 / <s>
[2023-12-23 20:08:49,971] [DEBUG] [axolotl.load_tokenizer:186] [PID:2954421] [RANK:0] PAD: 2 / </s>
[2023-12-23 20:08:49,971] [DEBUG] [axolotl.load_tokenizer:187] [PID:2954421] [RANK:0] UNK: 0 / <unk>
[2023-12-23 20:08:49,972] [INFO] [axolotl.load_tokenized_prepared_datasets:147] [PID:2954421] [RANK:0] Unable to find prepared dataset in /home/owen/tools/last_run_prepared/09f2d4e2c000dbf02c65a3c67cd65057
[2023-12-23 20:08:49,972] [INFO] [axolotl.load_tokenized_prepared_datasets:148] [PID:2954421] [RANK:0] Loading raw datasets...
[2023-12-23 20:08:49,972] [INFO] [axolotl.load_tokenized_prepared_datasets:153] [PID:2954421] [RANK:0] No seed provided, using default seed of 42
[2023-12-23 20:08:50,082] [WARNING] [axolotl.scripts.check_user_token:358] [PID:2954422] [RANK:1] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2023-12-23 20:08:50,277] [DEBUG] [axolotl.load_tokenizer:184] [PID:2954422] [RANK:1] EOS: 2 / </s>
[2023-12-23 20:08:50,277] [DEBUG] [axolotl.load_tokenizer:185] [PID:2954422] [RANK:1] BOS: 1 / <s>
[2023-12-23 20:08:50,277] [DEBUG] [axolotl.load_tokenizer:186] [PID:2954422] [RANK:1] PAD: 2 / </s>
[2023-12-23 20:08:50,277] [DEBUG] [axolotl.load_tokenizer:187] [PID:2954422] [RANK:1] UNK: 0 / <unk>
Map (num_proc=40): 100%|████████████████████████████████████████████████| 55307/55307 [00:02<00:00, 24688.37 examples/s]
Map (num_proc=40): 100%|███████████████████████████████████████████████████| 2515/2515 [00:00<00:00, 6919.98 examples/s]
[2023-12-23 20:08:53,968] [INFO] [axolotl.load_tokenized_prepared_datasets:362] [PID:2954421] [RANK:0] merging datasets
[2023-12-23 20:08:53,991] [INFO] [axolotl.load_tokenized_prepared_datasets:366] [PID:2954421] [RANK:0] shuffle merged datasets
[2023-12-23 20:08:54,017] [INFO] [axolotl.load_tokenized_prepared_datasets:369] [PID:2954421] [RANK:0] Saving merged prepared dataset to disk... /home/owen/tools/last_run_prepared/09f2d4e2c000dbf02c65a3c67cd65057
Saving the dataset (1/1 shards): 100%|█████████████████████████████████| 70472/70472 [00:00<00:00, 114127.59 examples/s]
[2023-12-23 20:08:56,293] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:2954422] [RANK:1] Loading prepared dataset from disk at /home/owen/tools/last_run_prepared/09f2d4e2c000dbf02c65a3c67cd65057...
[2023-12-23 20:08:56,297] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:2954422] [RANK:1] Prepared dataset loaded from disk...
Filter (num_proc=40): 100%|█████████████████████████████████████████████| 66948/66948 [00:00<00:00, 69838.03 examples/s]
Filter (num_proc=40): 100%|████████████████████████████████████████████████| 3524/3524 [00:00<00:00, 7597.29 examples/s]
[2023-12-23 20:08:58,882] [DEBUG] [axolotl.log:60] [PID:2954421] [RANK:0] total_num_tokens: 7809518
Filter (num_proc=40):  52%|███████████████████████                     | 35000/66948 [00:00<00:00, 103200.12 examples/s][2023-12-23 20:08:59,678] [DEBUG] [axolotl.log:60] [PID:2954421] [RANK:0] `total_supervised_tokens: 7809518`
[2023-12-23 20:08:59,679] [DEBUG] [axolotl.log:60] [PID:2954421] [RANK:0] total_num_steps: 8369
[2023-12-23 20:08:59,689] [DEBUG] [axolotl.train.log:60] [PID:2954421] [RANK:0] loading tokenizer... /home/owen/models/mistralai_Mixtral-8x7B-v0.1
Filter (num_proc=40): 100%|█████████████████████████████████████████████| 66948/66948 [00:00<00:00, 90529.83 examples/s]
[2023-12-23 20:08:59,887] [DEBUG] [axolotl.load_tokenizer:184] [PID:2954421] [RANK:0] EOS: 2 / </s>
[2023-12-23 20:08:59,887] [DEBUG] [axolotl.load_tokenizer:185] [PID:2954421] [RANK:0] BOS: 1 / <s>
[2023-12-23 20:08:59,887] [DEBUG] [axolotl.load_tokenizer:186] [PID:2954421] [RANK:0] PAD: 2 / </s>
[2023-12-23 20:08:59,887] [DEBUG] [axolotl.load_tokenizer:187] [PID:2954421] [RANK:0] UNK: 0 / <unk>
[2023-12-23 20:08:59,887] [DEBUG] [axolotl.train.log:60] [PID:2954421] [RANK:0] loading model and peft_config...
Filter (num_proc=40): 100%|███████████████████████████████████████████████| 3524/3524 [00:00<00:00, 11250.89 examples/s]
[2023-12-23 20:09:01,426] [DEBUG] [axolotl.load_tokenizer:184] [PID:2954422] [RANK:1] EOS: 2 / </s>
[2023-12-23 20:09:01,426] [DEBUG] [axolotl.load_tokenizer:185] [PID:2954422] [RANK:1] BOS: 1 / <s>
[2023-12-23 20:09:01,426] [DEBUG] [axolotl.load_tokenizer:186] [PID:2954422] [RANK:1] PAD: 2 / </s>
[2023-12-23 20:09:01,426] [DEBUG] [axolotl.load_tokenizer:187] [PID:2954422] [RANK:1] UNK: 0 / <unk>
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 19/19 [01:07<00:00,  3.53s/it]
[2023-12-23 20:10:10,400] [ERROR] [axolotl.load_model:478] [PID:2954421] [RANK:0] CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 24.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Process 2954422 has 17179869184.00 GiB memory in use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 121.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/owen/axolotl/src/axolotl/utils/models.py", line 469, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3773, in from_pretrained
    dispatch_model(model, **device_map_kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/big_modeling.py", line 396, in dispatch_model
    attach_align_device_hook_on_blocks(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 507, in attach_align_device_hook_on_blocks
    add_hook_to_module(module, hook)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 155, in add_hook_to_module
    module = hook.init_hook(module)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 253, in init_hook
    set_module_tensor_to_device(module, name, self.execution_device)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 311, in set_module_tensor_to_device
    new_value = old_value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 24.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Process 2954422 has 17179869184.00 GiB memory in use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 121.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 38, in <module>
    fire.Fire(do_cli)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/home/owen/axolotl/src/axolotl/train.py", line 62, in train
    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
  File "/home/owen/axolotl/src/axolotl/utils/models.py", line 479, in load_model
    raise err
  File "/home/owen/axolotl/src/axolotl/utils/models.py", line 469, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3773, in from_pretrained
    dispatch_model(model, **device_map_kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/big_modeling.py", line 396, in dispatch_model
    attach_align_device_hook_on_blocks(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 507, in attach_align_device_hook_on_blocks
    add_hook_to_module(module, hook)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 155, in add_hook_to_module
    module = hook.init_hook(module)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 253, in init_hook
    set_module_tensor_to_device(module, name, self.execution_device)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 311, in set_module_tensor_to_device
    new_value = old_value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 24.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Process 2954422 has 17179869184.00 GiB memory in use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 121.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 19/19 [01:07<00:00,  3.58s/it]
[2023-12-23 20:10:12,450] [ERROR] [axolotl.load_model:478] [PID:2954422] [RANK:1] CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacty of 24.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 121.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/owen/axolotl/src/axolotl/utils/models.py", line 469, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3773, in from_pretrained
    dispatch_model(model, **device_map_kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/big_modeling.py", line 396, in dispatch_model
    attach_align_device_hook_on_blocks(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 507, in attach_align_device_hook_on_blocks
    add_hook_to_module(module, hook)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 155, in add_hook_to_module
    module = hook.init_hook(module)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 253, in init_hook
    set_module_tensor_to_device(module, name, self.execution_device)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 311, in set_module_tensor_to_device
    new_value = old_value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacty of 24.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 121.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 38, in <module>
    fire.Fire(do_cli)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/home/owen/axolotl/src/axolotl/train.py", line 62, in train
    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
  File "/home/owen/axolotl/src/axolotl/utils/models.py", line 479, in load_model
    raise err
  File "/home/owen/axolotl/src/axolotl/utils/models.py", line 469, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3773, in from_pretrained
    dispatch_model(model, **device_map_kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/big_modeling.py", line 396, in dispatch_model
    attach_align_device_hook_on_blocks(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 507, in attach_align_device_hook_on_blocks
    add_hook_to_module(module, hook)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 155, in add_hook_to_module
    module = hook.init_hook(module)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 253, in init_hook
    set_module_tensor_to_device(module, name, self.execution_device)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 311, in set_module_tensor_to_device
    new_value = old_value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacty of 24.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 121.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-12-23 20:10:13,458] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2954422 closing signal SIGTERM
[2023-12-23 20:10:13,923] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2954421) of binary: /home/owen/miniconda3/envs/axolotl/bin/python
Traceback (most recent call last):
  File "/home/owen/miniconda3/envs/axolotl/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-23_20:10:13
  host      : COMPUTE-PC.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2954421)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Steps to reproduce

Try to train any model larger than 13b on 2x24GB GPUs. Maybe possible only happening if on Turing GPUs that don't support flash attention 2.

Config yaml

base_model: /home/owen/models/mistralai_Mixtral-8x7B-v0.1
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
trust_remote_code: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: /home/owen/datasets/sunda-wiki-cleaned.jsonl
    type: completion
  - path: /home/owen/datasets/sunda-twitter.jsonl
    type: completion

dataset_prepared_path: /home/owen/tools/last_run_prepared
val_set_size: 0.05
output_dir: ./qlora-out

## You can optionally freeze the entire model and unfreeze a subset of parameters
unfrozen_parameters:
#  - lm_head.*
#  - model.embed_tokens.*
#  - model.layers.2[0-9]+.block_sparse_moe.gate.*
#  - model.layers.2[0-9]+.block_sparse_moe.experts.*
#  - model.layers.3[0-9]+.block_sparse_moe.gate.*
#  - model.layers.3[0-9]+.block_sparse_moe.experts.*

model_config:
  output_router_logits: true

save_safetensors: true

adapter: qlora
lora_model_dir: 

sequence_len: 256
sample_packing: false
pad_to_sequence_len: true

lora_r: 8
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: false
lora_fan_in_fan_out:
lora_target_modules:
#  - gate
  - q_proj
#  - k_proj
  - v_proj
#  - o_proj
#  - w1
#  - w2
#  - w3

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: false
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention: false
sdp_attention: false

loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_table_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed: /home/owen/tools/zero2.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main/628b754824008f2d7c1aad079925a1d8e8cf9f48

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Dec 24 '23 04:12 Nero10578

Deepspeed ZeRO-2 with DDP using dual GPUs does not shard the parameters across GPUs. DDP will attempt to load the full parameters on each GPU. You'll want to use Zero3, or naive model parallelism by launching a single process with python.

Dec 24 '23 15:12 winglian

Deepspeed ZeRO-2 with DDP using dual GPUs does not shard the parameters across GPUs. DDP will attempt to load the full parameters on each GPU. You'll want to use Zero3, or naive model parallelism by launching a single process with python.

I tried with Zero3 and have the same effect of OOM on loading in the beginning. I also just got some RTX 3090s so I can use flash attention 2 and sample packing and it still OOM on Deepspeed zero3 with mixtral and 70b models as well.

How do I use naive model parallelism in axolotl? Or does it have to be a manually created python training script?

Dec 25 '23 13:12 Nero10578

Deepspeed ZeRO-2 with DDP using dual GPUs does not shard the parameters across GPUs. DDP will attempt to load the full parameters on each GPU. You'll want to use Zero3, or naive model parallelism by launching a single process with python.

Ok so I realised I can just run without accelerate to use naive MP but then I get this error about not all tensors in the same device. All I did is launch with python instead of accelerate and set device_map: sequential and max_memory: {0: "20GIB", 1: "24GIB"}

warnings.warn(
Traceback (most recent call last):
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 38, in <module>
    fire.Fire(do_cli)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/home/owen/axolotl/src/axolotl/train.py", line 129, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/owen/axolotl/src/axolotl/core/trainer_builder.py", line 291, in compute_loss
    return super().compute_loss(model, inputs, return_outputs=return_outputs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/trainer.py", line 2758, in compute_loss
    outputs = model(**inputs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/operations.py", line 659, in forward
    return model_forward(*args, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/utils/operations.py", line 647, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/peft/peft_model.py", line 977, in forward
    return self.base_model(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 106, in forward
    return self.model.forward(*args, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1258, in forward
    loss += self.router_aux_loss_coef * aux_loss
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
  0%|                                                                                          | 0/8913 [00:03<?, ?it/s]

Dec 25 '23 22:12 Nero10578

FTR, I am seeing the expected tensors issue on my end as well. This claims to fix a similar-looking issue, but it doesn't seem to do the trick on my end.

Update: this is no longer the case for me with latest Axolotl.

Dec 27 '23 05:12 kallewoof

I am having the same issue while using Zero3 deepspeed. Running it on 1x 3090 and 1x 4060 (40gb vram)

Output:

docker run --privileged --gpus '"all"' --shm-size 10g --rm --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=volume,src=axolotl,target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface  -v ${HOME}:/workspace/axolotl/home winglian/axolotl:main-py3.10-cu118-2.0.1 accelerate launch -m axolotl.cli.train --deepspeed deepspeed/zero3.json home/Jupyter/work/Train/lora.yml


==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `2`
                More than one GPU was found, enabling multi-GPU training.
                If this was unintended please pass in `--num_processes=1`.
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:106: UserWarning:

================================================================================
WARNING: Manual override via BNB_CUDA_VERSION env variable detected!
BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64
Loading CUDA version: BNB_CUDA_VERSION=118
================================================================================


  warn((f'\n\n{"="*80}\n'
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:106: UserWarning:

================================================================================
WARNING: Manual override via BNB_CUDA_VERSION env variable detected!
BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64
Loading CUDA version: BNB_CUDA_VERSION=118
================================================================================


  warn((f'\n\n{"="*80}\n'
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
                                 dP            dP   dP
                                 88            88   88
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP



[2023-12-28 09:53:03,318] [WARNING] [axolotl.validate_config:250] [PID:33] [RANK:0] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
[2023-12-28 09:53:03,319] [WARNING] [axolotl.validate_config:250] [PID:34] [RANK:1] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
[2023-12-28 09:53:03,529] [INFO] [axolotl.normalize_config:150] [PID:33] [RANK:0] GPU memory usage baseline: 0.000GB (+0.389GB misc)
[2023-12-28 09:53:03,530] [WARNING] [axolotl.scripts.check_user_token:358] [PID:33] [RANK:0] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2023-12-28 09:53:03,694] [INFO] [axolotl.normalize_config:150] [PID:34] [RANK:1] GPU memory usage baseline: 0.000GB (+0.302GB misc)
[2023-12-28 09:53:03,694] [WARNING] [axolotl.scripts.check_user_token:358] [PID:34] [RANK:1] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2023-12-28 09:53:03,994] [DEBUG] [axolotl.load_tokenizer:166] [PID:33] [RANK:0] EOS: 2 / </s>
[2023-12-28 09:53:03,994] [DEBUG] [axolotl.load_tokenizer:167] [PID:33] [RANK:0] BOS: 1 / <s>
[2023-12-28 09:53:03,994] [DEBUG] [axolotl.load_tokenizer:168] [PID:33] [RANK:0] PAD: 2 / </s>
[2023-12-28 09:53:03,994] [DEBUG] [axolotl.load_tokenizer:169] [PID:33] [RANK:0] UNK: 0 / <unk>
[2023-12-28 09:53:03,994] [INFO] [axolotl.load_tokenized_prepared_datasets:147] [PID:33] [RANK:0] Unable to find prepared dataset in last_run_prepared/8887c8b737ea66da5b9f500be70f0a42
[2023-12-28 09:53:03,994] [INFO] [axolotl.load_tokenized_prepared_datasets:148] [PID:33] [RANK:0] Loading raw datasets...
[2023-12-28 09:53:03,994] [INFO] [axolotl.load_tokenized_prepared_datasets:153] [PID:33] [RANK:0] No seed provided, using default seed of 42
[2023-12-28 09:53:04,057] [DEBUG] [axolotl.load_tokenizer:166] [PID:34] [RANK:1] EOS: 2 / </s>
[2023-12-28 09:53:04,057] [DEBUG] [axolotl.load_tokenizer:167] [PID:34] [RANK:1] BOS: 1 / <s>
[2023-12-28 09:53:04,057] [DEBUG] [axolotl.load_tokenizer:168] [PID:34] [RANK:1] PAD: 2 / </s>
[2023-12-28 09:53:04,058] [DEBUG] [axolotl.load_tokenizer:169] [PID:34] [RANK:1] UNK: 0 / <unk>
[2023-12-28 09:53:06,156] [INFO] [axolotl.load_tokenized_prepared_datasets:362] [PID:33] [RANK:0] merging datasets
[2023-12-28 09:53:06,162] [INFO] [axolotl.load_tokenized_prepared_datasets:366] [PID:33] [RANK:0] shuffle merged datasets
[2023-12-28 09:53:06,170] [INFO] [axolotl.load_tokenized_prepared_datasets:369] [PID:33] [RANK:0] Saving merged prepared dataset to disk... last_run_prepared/8887c8b737ea66da5b9f500be70f0a42
Saving the dataset (1/1 shards): 100%|██████████| 44025/44025 [00:05<00:00, 8680.60 examples/s]
[2023-12-28 09:53:12,343] [INFO] [axolotl.load_tokenized_prepared_datasets:147] [PID:34] [RANK:1] Unable to find prepared dataset in last_run_prepared/8887c8b737ea66da5b9f500be70f0a42
[2023-12-28 09:53:12,343] [INFO] [axolotl.load_tokenized_prepared_datasets:148] [PID:34] [RANK:1] Loading raw datasets...
[2023-12-28 09:53:12,343] [INFO] [axolotl.load_tokenized_prepared_datasets:153] [PID:34] [RANK:1] No seed provided, using default seed of 42
[2023-12-28 09:53:13,844] [INFO] [axolotl.load_tokenized_prepared_datasets:362] [PID:34] [RANK:1] merging datasets
[2023-12-28 09:53:13,850] [INFO] [axolotl.load_tokenized_prepared_datasets:366] [PID:34] [RANK:1] shuffle merged datasets
[2023-12-28 09:53:14,597] [DEBUG] [axolotl.log:60] [PID:33] [RANK:0] total_num_tokens: 29114292
[2023-12-28 09:53:16,888] [DEBUG] [axolotl.log:60] [PID:33] [RANK:0] `total_supervised_tokens: 28820613`
[2023-12-28 09:53:24,005] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:33] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 14557146
[2023-12-28 09:53:24,005] [DEBUG] [axolotl.log:60] [PID:33] [RANK:0] data_loader_len: 28821
[2023-12-28 09:53:24,221] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:34] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 14557146
[2023-12-28 09:53:24,258] [INFO] [axolotl.log:60] [PID:33] [RANK:0] sample_packing_eff_est across ranks: [0.810824990272522, 0.8108927011489868]
[2023-12-28 09:53:24,259] [DEBUG] [axolotl.log:60] [PID:33] [RANK:0] sample_packing_eff_est: 0.82
[2023-12-28 09:53:24,259] [DEBUG] [axolotl.log:60] [PID:33] [RANK:0] total_num_steps: 14410
[2023-12-28 09:53:24,265] [DEBUG] [axolotl.train.log:60] [PID:33] [RANK:0] loading tokenizer... cognitivecomputations/dolphin-2.5-mixtral-8x7b
[2023-12-28 09:53:24,626] [DEBUG] [axolotl.load_tokenizer:166] [PID:33] [RANK:0] EOS: 2 / </s>
[2023-12-28 09:53:24,626] [DEBUG] [axolotl.load_tokenizer:167] [PID:33] [RANK:0] BOS: 1 / <s>
[2023-12-28 09:53:24,626] [DEBUG] [axolotl.load_tokenizer:168] [PID:33] [RANK:0] PAD: 2 / </s>
[2023-12-28 09:53:24,626] [DEBUG] [axolotl.load_tokenizer:169] [PID:33] [RANK:0] UNK: 0 / <unk>
[2023-12-28 09:53:24,626] [DEBUG] [axolotl.train.log:60] [PID:33] [RANK:0] loading model and peft_config...
[2023-12-28 09:53:24,643] [DEBUG] [axolotl.load_tokenizer:166] [PID:34] [RANK:1] EOS: 2 / </s>
[2023-12-28 09:53:24,643] [DEBUG] [axolotl.load_tokenizer:167] [PID:34] [RANK:1] BOS: 1 / <s>
[2023-12-28 09:53:24,643] [DEBUG] [axolotl.load_tokenizer:168] [PID:34] [RANK:1] PAD: 2 / </s>
[2023-12-28 09:53:24,644] [DEBUG] [axolotl.load_tokenizer:169] [PID:34] [RANK:1] UNK: 0 / <unk>
[2023-12-28 09:53:24,746] [INFO] [axolotl.load_model:249] [PID:33] [RANK:0] patching with flash attention
[2023-12-28 09:53:24,747] [INFO] [axolotl.load_model:261] [PID:33] [RANK:0] patching with flash attention
[2023-12-28 09:53:24,764] [INFO] [axolotl.load_model:249] [PID:34] [RANK:1] patching with flash attention
[2023-12-28 09:53:24,764] [INFO] [axolotl.load_model:261] [PID:34] [RANK:1] patching with flash attention
Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards:  63%|██████▎   | 12/19 [11:28<06:41, 57.38s/it]
[2023-12-28 10:04:57,187] [ERROR] [axolotl.load_model:453] [PID:34] [RANK:1] CUDA out of memory. Tried to allocate 28.00 MiB (GPU 1; 15.71 GiB total capacity; 15.39 GiB already allocated; 24.31 MiB free; 15.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 444, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3706, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4116, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 786, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 191, in to
    return self.cuda(device)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 169, in cuda
    w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/functional.py", line 934, in quantize_4bit
    out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 1; 15.71 GiB total capacity; 15.39 GiB already allocated; 24.31 MiB free; 15.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in <module>
    fire.Fire(do_cli)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/workspace/axolotl/src/axolotl/train.py", line 62, in train
    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 454, in load_model
    raise err
  File "/workspace/axolotl/src/axolotl/utils/models.py", line 444, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3706, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4116, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/modeling_utils.py", line 786, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 98, in set_module_quantized_tensor_to_device
    new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).to(device)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 191, in to
    return self.cuda(device)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 169, in cuda
    w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, compress_statistics=self.compress_statistics, quant_type=self.quant_type)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/functional.py", line 934, in quantize_4bit
    out = torch.zeros(((n+1)//2, 1), dtype=torch.uint8, device=A.device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 1; 15.71 GiB total capacity; 15.39 GiB already allocated; 24.31 MiB free; 15.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Loading checkpoint shards:  74%|███████▎  | 14/19 [12:28<04:39, 55.88s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 33 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 34) of binary: /root/miniconda3/envs/py3.10/bin/python3
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-28_10:05:57
  host      : 217b5bc7b838
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 34)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Lora.yml:

base_model: cognitivecomputations/dolphin-2.5-mixtral-8x7b
model_type: AutoModelForCausalLM
tokenizer_type: CodeLlamaTokenizer
trust_remote_code: true
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
#
dataset_prepared_path:
val_set_size: 0.0
output_dir: ./home/Jupyter/work/Train/lora-out

sequence_len: 1000
sample_packing: true
pad_to_sequence_len: true
max_memory: {0: "18GIB", 1: "14GIB"}
device_map: sequential
quantization_config:
  load_in_8bit_fp32_cpu_offload: True

adapter: qlora
lora_model_dir:
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed: deepspeed/zero3.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

nvitop output (vram is nicely shared, this is before the OOM): CleanShot 2023-12-28 at 11 04 06@2x

Dec 28 '23 10:12 Taronyuu

Latest axolotl does not produce this error for me anymore.

Feb 07 '24 04:02 kallewoof

@kallewoof , thanks for checking. Will close this for now. If issue comes back, please comment/reopen.

Mar 30 '24 18:03 NanoCode012

I'm still having the same issue. I'm using VRAM of 4x24GB cards, and OOM occur even with deepspeed zero2 or zero3 config. Messages are almost same with @Nero10578 . Is there any change?

Apr 02 '24 01:04 rhksdn2314

@rhksdn2314 I have fine tuned mixtral 8x7b on 2x3090 cards using Axolotl without OOMing.

What rank/context length are you using?

Apr 02 '24 02:04 kallewoof

axolotl axolotl copied to clipboard

QLoRa Fine Tuning Mixtral 8x7b or 34b models OOM on 2x24GB Titan RTX

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

axolotl
axolotl copied to clipboard