stylegan2-pytorch icon indicating copy to clipboard operation
stylegan2-pytorch copied to clipboard

Train.py hanging when running on a single GPU

Open albusdemens opened this issue 5 years ago • 3 comments

I am having issues to use your reimplementation to train an agent on my data. When I run the code on my desktop, I get the error CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 10.76 GiB total capacity; 7.05 GiB already allocated; 55.69 MiB free; 166.59 MiB cached) 0%| | 0/800000 [00:00<?, ?it/s]

I also have access to a GPU cluster and I tried to run the script there using CUDA_VISIBLE_DEVICES=7 python train.py --batch 4 ./Maps_512/. Here, I don't get any output after launching the command and from nvidia-smi it looks like the GPU is never used. Do you have suggestions on why is that?

albusdemens avatar Jan 28 '20 17:01 albusdemens

On the cluster, the command CUDA_VISIBLE_DEVICES=7 python -m trace --trace train.py --batch 4 ./Maps_512/ gives an error ending with --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds) file_baton.py(48): while os.path.exists(self.lock_file_path): --- modulename: genericpath, funcname: exists genericpath.py(18): try: genericpath.py(19): os.stat(path) genericpath.py(22): return True file_baton.py(49): time.sleep(self.wait_seconds)

albusdemens avatar Jan 28 '20 17:01 albusdemens

hi have you sovled the problem?

ykang-cool avatar Jul 01 '21 05:07 ykang-cool

For me, I downgrade the version of pytorch from 1.7.0 to 1.3.1 then this problem is fixed.

yeates avatar Sep 27 '21 07:09 yeates