GeneFace icon indicating copy to clipboard operation
GeneFace copied to clipboard

我的机器配置是3090 运行CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config=egs/datasets/videos/May/lm3d_postnet_sync.yaml --exp_name=May/postnet

Open zxc524580210 opened this issue 2 years ago • 8 comments

出现内存溢出提示 请问作者怎么解决呢? RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

zxc524580210 avatar Apr 20 '23 15:04 zxc524580210

你好,之前没有遇到显存不足的bug。可否提供更多细节,比如可以设断点debug一下各个阶段,load 模型前后显存占用的变化。会不会是别的进程占用了显存?

yerfor avatar Apr 21 '23 04:04 yerfor

请教下一般用什么工具debug呢?我用的最新版本的pytorch 我改用作者的3090配置文件试一试

zxc524580210 avatar Apr 21 '23 06:04 zxc524580210

我本机是wsl + ubuntu的环境

zxc524580210 avatar Apr 21 '23 06:04 zxc524580210

我一般都是watch -n 0.1 nvidia-smi实时观察显存哈哈哈

yerfor avatar Apr 21 '23 06:04 yerfor

666 我试试,感谢。

zxc524580210 avatar Apr 21 '23 08:04 zxc524580210

这个问题解决了吗? 遇到同样的问题。 2023-05-15 09:43:36.712516: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. Sanity Val: 20%|██████████████▏ | 1/5 [00:04<00:18, 4.60s/step]| Validation results@0: {'total_loss': 0.2005, 'mse': 0.0599, 'sync': 0.1405} 0step [00:05, ?step/s] Traceback (most recent call last): File "/home/llama/GeneFace/tasks/run.py", line 19, in run_task() File "/home/llama/GeneFace/tasks/run.py", line 14, in run_task task_cls.start() File "/home/llama/GeneFace/utils/commons/base_task.py", line 251, in start trainer.fit(cls) File "/home/llama/GeneFace/utils/commons/trainer.py", line 122, in fit self.run_single_process(self.task) File "/home/llama/GeneFace/utils/commons/trainer.py", line 186, in run_single_process self.train() File "/home/llama/GeneFace/utils/commons/trainer.py", line 284, in train for batch_idx, batch in enumerate(train_pbar): File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/tqdm/std.py", line 1178, in iter for obj in iterable: File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data return self._process_data(data) File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data data.reraise() File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise raise exception RuntimeError: Caught RuntimeError in pin memory thread for device 0. Original Traceback (most recent call last): File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop data = pin_memory(data) File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory return type(data)({k: pin_memory(sample) for k, sample in data.items()}) # type: ignore[call-arg] File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in return type(data)({k: pin_memory(sample) for k, sample in data.items()}) # type: ignore[call-arg] File "/home/llama/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory return data.pin_memory() RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

image GPU: 2080TI 22G

Joyshub avatar May 15 '23 02:05 Joyshub

补充一下,运行是看了GPU memory空余还有20G左右

Joyshub avatar May 15 '23 02:05 Joyshub

kill -9 PID,查出所有的PID占用,然后杀死

EricDJL avatar Jun 09 '23 03:06 EricDJL