mcvd-pytorch
mcvd-pytorch copied to clipboard
Error in training on MNIST
Hi!
When I was training on MNIST with command:
CUDA_VISIBLE_DEVICES=0 python main.py --config configs/smmnist_DDPM_big5.yml --data_path /cluster/51/dichang/datasets/mcvd --exp smmnist_cat --ni
I received following error: smmnist_cat/logs/meters.pkl does not exist! Returning.
ERROR - main.py - 2022-06-16 21:39:49,313 - Traceback (most recent call last):
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
subprocess.run(
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
I checked the class NCSNRunner and load_meters(),it seems it's trying to load from "meters_pkl = os.path.join(self.args.log_path, 'meters.pkl')". What's the meters.pkl here? And how can I solve the error?
Thanks!
It contains the metrics over time. I'm not sure why it would do that. 🤔
Check that the folders exists: /cluster/51/dichang/datasets/mcvd and /cluster/51/dichang/datasets/mcvd/smmnist_cat/log. Make sure that the ninja package is installed properly.
I agree, I don't think it has to do with the metrics. Check the data folder exists, and check your ninja installation. Maybe install ninja at the end.
Could you please tell me the pytorch and ninja version you're using for training? Thanks.
From my side it doesn't work on torch==1.11.0 and ninja==1.10.2.3
But when I use torch on cpu, it works.
Same issue!
I'm using ninja==1.10.2.3, torch==1.10.0 on my local machine with CPU, and torch==1.11.0 with GPUs. In both cases, training works.
Could you please tell us the CUDA version and the type of GPUs you are using?
Could you please tell us the CUDA version and the type of GPUs you are using?
I'm usingCUDA==11.3,torch==1.11.0,GPU is NVIDIA RTX3090Ti. while training,the same issue was encountered . What should I do? Thank you
It seems like other people have had similar issues and they propose some solutions, see: https://github.com/mapillary/inplace_abn/issues/104 and https://github.com/mapillary/inplace_abn/issues/106#issuecomment-475460496.
I really don't know what to do with ninja or even what it does. 😞 I hope that some of these proposed solutions can work for you. If you find a solution to this problem, let us know and we can mention it in the README.