stylegan2-pytorch
stylegan2-pytorch copied to clipboard
Train.py hanging when running on a single GPU
I am having issues to use your reimplementation to train an agent on my data. When I run the code on my desktop, I get the error CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 10.76 GiB total capacity; 7.05 GiB already allocated; 55.69 MiB free; 166.59 MiB cached) 0%| | 0/800000 [00:00<?, ?it/s]
I also have access to a GPU cluster and I tried to run the script there using CUDA_VISIBLE_DEVICES=7 python train.py --batch 4 ./Maps_512/
. Here, I don't get any output after launching the command and from nvidia-smi
it looks like the GPU is never used. Do you have suggestions on why is that?
On the cluster, the command CUDA_VISIBLE_DEVICES=7 python -m trace --trace train.py --batch 4 ./Maps_512/
gives an error ending with --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds)
hi have you sovled the problem?
For me, I downgrade the version of pytorch from 1.7.0 to 1.3.1 then this problem is fixed.