当前许多人反馈OOM,然而我重新拉取代码并没有出现这个问题。我感觉可能是系统环境有区别。我正在排查这个问题。
[En] Currently a lot of people are giving feedback on OOM, however I'm not having this problem by re-pulling the code. I'm speculating that there may be a difference in the system environment. I am trying to resolve this issue.
Traceback (most recent call last):
File "/export/App/training_platform/PinoModel/omni-llava/llava/train/train_mem.py", line 21, in
train()
File "/export/App/training_platform/PinoModel/omni-llava/llava/train/train.py", line 1193, in train
trainer.train()
File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/accelerate/accelerator.py", line 1198, in prepare
result = self._prepare_deepspeed(*args)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 267, in init
self._configure_distributed_model(model)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1048, in _configure_distributed_model
self.module.to(self.device)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2460, in to
return super().to(*args, **kwargs)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 4 has a total capacty of 79.11 GiB of which 18.69 MiB is free. Process 158998 has 79.08 GiB memory in use. Of the allocated memory 78.33 GiB is allocated by PyTorch, and 244.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-01-08 13:34:24,546] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1421135
你用了几个GPU?看样子是在deepspeed 初始化时候就崩溃了,一般这种情况和模型无关,所以batch size=1也不会改变结果。
[En] How many GPUs are you using? it looks like it crashes during deepspeed initialization, which is generally model-independent, so batch size=1 won't change the results.
@LinB203 Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.9910638332366943 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.8254106044769287 seconds
Traceback (most recent call last):
File "/export/App/training_platform/PinoModel/omni-llava/llava/train/train_mem.py", line 21, in
train()
File "/export/App/training_platform/PinoModel/omni-llava/llava/train/train.py", line 1191, in train
trainer.train()
File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/trainer.py", line 1672, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/accelerate/accelerator.py", line 1198, in prepare
result = self._prepare_deepspeed(*args)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 267, in init
self._configure_distributed_model(model)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1048, in _configure_distributed_model
self.module.to(self.device)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2460, in to
return super().to(*args, **kwargs)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/opt/conda/envs/omni/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 2 has a total capacty of 79.11 GiB of which 18.69 MiB is free. Process 3425768 has 79.08 GiB memory in use. Of the allocated memory 78.33 GiB is allocated by PyTorch, and 244.83 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-01-08 16:17:55,409] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3648081
[2024-01-08 16:18:04,889] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3648082
[2024-01-08 16:18:08,382] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3648083
[2024-01-08 16:18:08,383] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3648084
[2024-01-08 16:18:13,057] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3648085
Hi, we reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh. But unfortunately we still can't use zero3, and we suspect that deepspeed doesn't support the load imbalance between GPUs very well.
Hi, we reorganize the code and support LoRA fine-tuning, checking finetune_lora.sh. But unfortunately we still can't use zero3, and we suspect that deepspeed doesn't support the load imbalance between GPUs very well.
Have you been able to fix zero 3, having some error "get_peft_model()"