Pixel2Mesh icon indicating copy to clipboard operation
Pixel2Mesh copied to clipboard

Error in DataParallel when trying to train with multiple gpus with torch 1.11

Open amirbarda opened this issue 2 years ago • 1 comments

I am getting this error when trying to train with multiple gpus: File ./lib/python3.7/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape NotImplementedError: Could not run 'aten::view' with arguments from the 'SparseCUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::view' is only available for these backends: [CPU, CUDA, Meta, QuantizedCPU, QuantizedCUDA, MkldnnCPU, BackendSelect, Python, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradLazy, AutogradXPU, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, AutocastCPU, Autocast, Batched, VmapMode, Functionalize]. . I am trying to run the defualt tenserflow.yml experiment. It works fine on a single gpu. Evaluation works on multiple gpus.

I looked at the blocks in the module and could not find anything involving sparsity. I am using pytorch 1.11 with cuda 11.3. any idea why this is? thanks.

amirbarda avatar Apr 08 '22 12:04 amirbarda

Sorry I think this repo is bit of out of maintanence. If you are not interested in digging out the reason, you can try with pytorch version mentioned in readme.

ultmaster avatar Apr 17 '22 01:04 ultmaster