MiniGPT-4
MiniGPT-4 copied to clipboard
RuntimeError: CUDA out of memory.
Can anyone help me on this error?
RuntimeError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 22.20 GiB total capacity; 21.35 GiB already allocated; 64.12 MiB free; 21.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11203) of binary: ~/miniconda3/envs/minigpt4/bin/python Traceback (most recent call last): File "~/miniconda3/envs/minigpt4/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "~/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "~/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main run(args) File "~/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "~/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "~/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED
Failures: <NO_OTHER_FAILURES>
Update GPU infomation:
Thu Apr 27 18:02:17 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 On | 00000000:00:07.0 Off | 0 |
| 0% 27C P8 9W / 150W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I have solved the problem by modifying the train_configs/minigpt4_stage2_finetune.yaml
as below:
iters_per_epoch: 20
batch_size_train: 2
batch_size_eval: 4
num_workers: 2
warmup_steps: 20
I didn't dive deeper to find the most efficient combination. There may be tuning room still.
I have solved the problem by modifying the
train_configs/minigpt4_stage2_finetune.yaml
as below:iters_per_epoch: 20 batch_size_train: 2 batch_size_eval: 4 num_workers: 2 warmup_steps: 20
I didn't dive deeper to find the most efficient combination. There may be tuning room still.
hello, can I please ask what is your cuda and torch? I use torch2.0 and cuda11.7, original finetune can run in 8 gpus and bs=12. But when I use torch1.13 and cuda11.6, I can only finetune with bs=2 like you said
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10 On | 00000000:00:07.0 Off | 0 |
| 0% 35C P0 40W / 150W| 4MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1024 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
Torch version:
torch 2.0.0 pypi_0 pypi
torchaudio 0.13.1 py39_cu117 pytorch
torchvision 0.14.1 py39_cu117 pytorch