mcvd-pytorch icon indicating copy to clipboard operation
mcvd-pytorch copied to clipboard

Error in training on MNIST

Open Boese0601 opened this issue 2 years ago • 8 comments

Hi!

When I was training on MNIST with command: CUDA_VISIBLE_DEVICES=0 python main.py --config configs/smmnist_DDPM_big5.yml --data_path /cluster/51/dichang/datasets/mcvd --exp smmnist_cat --ni

I received following error: smmnist_cat/logs/meters.pkl does not exist! Returning. ERROR - main.py - 2022-06-16 21:39:49,313 - Traceback (most recent call last):
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
subprocess.run(
File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

I checked the class NCSNRunner and load_meters(),it seems it's trying to load from "meters_pkl = os.path.join(self.args.log_path, 'meters.pkl')". What's the meters.pkl here? And how can I solve the error?

Thanks!

Boese0601 avatar Jun 16 '22 19:06 Boese0601

It contains the metrics over time. I'm not sure why it would do that. 🤔

Check that the folders exists: /cluster/51/dichang/datasets/mcvd and /cluster/51/dichang/datasets/mcvd/smmnist_cat/log. Make sure that the ninja package is installed properly.

AlexiaJM avatar Jun 17 '22 17:06 AlexiaJM

I agree, I don't think it has to do with the metrics. Check the data folder exists, and check your ninja installation. Maybe install ninja at the end.

voletiv avatar Jun 17 '22 17:06 voletiv

Could you please tell me the pytorch and ninja version you're using for training? Thanks.

From my side it doesn't work on torch==1.11.0 and ninja==1.10.2.3

But when I use torch on cpu, it works.

Boese0601 avatar Jun 18 '22 11:06 Boese0601

Same issue!

dhruv-nathawani avatar Jul 14 '22 23:07 dhruv-nathawani

I'm using ninja==1.10.2.3, torch==1.10.0 on my local machine with CPU, and torch==1.11.0 with GPUs. In both cases, training works.

voletiv avatar Jul 14 '22 23:07 voletiv

Could you please tell us the CUDA version and the type of GPUs you are using?

dhruv-nathawani avatar Jul 15 '22 00:07 dhruv-nathawani

Could you please tell us the CUDA version and the type of GPUs you are using?

I'm usingCUDA==11.3,torch==1.11.0,GPU is NVIDIA RTX3090Ti. while training,the same issue was encountered . What should I do? Thank you

1094724913 avatar Aug 04 '23 13:08 1094724913

It seems like other people have had similar issues and they propose some solutions, see: https://github.com/mapillary/inplace_abn/issues/104 and https://github.com/mapillary/inplace_abn/issues/106#issuecomment-475460496.

I really don't know what to do with ninja or even what it does. 😞 I hope that some of these proposed solutions can work for you. If you find a solution to this problem, let us know and we can mention it in the README.

AlexiaJM avatar Aug 04 '23 14:08 AlexiaJM