出现内存溢出提示 请问作者怎么解决呢?
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
你好,之前没有遇到显存不足的bug。可否提供更多细节,比如可以设断点debug一下各个阶段,load 模型前后显存占用的变化。会不会是别的进程占用了显存?
请教下一般用什么工具debug呢?我用的最新版本的pytorch 我改用作者的3090配置文件试一试
我一般都是watch -n 0.1 nvidia-smi实时观察显存哈哈哈
这个问题解决了吗?
遇到同样的问题。
2023-05-15 09:43:36.712516: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Sanity Val: 20%|██████████████▏ | 1/5 [00:04<00:18, 4.60s/step]| Validation results@0: {'total_loss': 0.2005, 'mse': 0.0599, 'sync': 0.1405}
0step [00:05, ?step/s]
Traceback (most recent call last):
File "/home/llama/GeneFace/tasks/run.py", line 19, in
run_task()
File "/home/llama/GeneFace/tasks/run.py", line 14, in run_task
task_cls.start()
File "/home/llama/GeneFace/utils/commons/base_task.py", line 251, in start
trainer.fit(cls)
File "/home/llama/GeneFace/utils/commons/trainer.py", line 122, in fit
self.run_single_process(self.task)
File "/home/llama/GeneFace/utils/commons/trainer.py", line 186, in run_single_process
self.train()
File "/home/llama/GeneFace/utils/commons/trainer.py", line 284, in train
for batch_idx, batch in enumerate(train_pbar):
File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in next
data = self._next_data()
File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
return self._process_data(data)
File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
data.reraise()
File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise
raise exception
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop
data = pin_memory(data)
File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
return type(data)({k: pin_memory(sample) for k, sample in data.items()}) # type: ignore[call-arg]
File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in
return type(data)({k: pin_memory(sample) for k, sample in data.items()}) # type: ignore[call-arg]
File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory
return data.pin_memory()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
GPU: 2080TI 22G
补充一下,运行是看了GPU memory空余还有20G左右
kill -9 PID,查出所有的PID占用,然后杀死