wenet
wenet copied to clipboard
Adapt for pytorch version >= 1.9
Can you describe the problem when you use the old code with Pytorch 1.9
i tried a lot, but all fails.
then i chage my code like this sample:
https://github.com/pytorch/examples/blob/master/imagenet/main.py
and it works!
what's the error message?
Traceback (most recent call last):
File "wenet/bin/train.py", line 199, in
. .
our machine supports nccl.
Any progress?
Has anyone tried orch version >= 1.9?
Any progress?
Has anyone tried orch version >= 1.9?
GPU 1080 ti and TITAN RTX has no problem in PyTorch 1.9.0, but 2080Ti has the same problem.
Traceback (most recent call last): File "wenet/bin/train.py", line 199, in Traceback (most recent call last): File "wenet/bin/train.py", line 199, in model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init dist._verify_model_across_ranks(self.process_group, parameters) dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandl ed cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.
. .
our machine supports nccl.
What is the graphics card you are using?
Traceback (most recent call last): File "wenet/bin/train.py", line 199, in Traceback (most recent call last): File "wenet/bin/train.py", line 199, in model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init dist._verify_model_across_ranks(self.process_group, parameters) dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandl ed cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed. . . our machine supports nccl.
What is the graphics card you are using?
3090
Traceback (most recent call last): File "wenet/bin/train.py", line 199, in Traceback (most recent call last): File "wenet/bin/train.py", line 199, in model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init model, find_unused_parameters=True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 496, in init dist._verify_model_across_ranks(self.process_group, parameters) dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1623448265233/work/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandl ed cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed. . . our machine supports nccl.
What is the graphics card you are using?
3090
I occurs same error on the nvidia RTX 3080 in docker environment.
any update now?
@robin1001 .
Util now, I found that one way was work for me in the pytorch1.8 or pytorch1.9,
I use the command
docker run -it --rm --gpus '"device=1,2"' --shm-size=1g --ulimit memlock=-1 --ipc=host -v /mnt/4T:/mnt/4T -v /home/maduo/docker_asr_wenet_nvidia_gpu/wenet:/code/w2021/wenet asr_wenet_nvidia_gpu:latest
to run wenet, but in the run.sh I must be set ``export CUDA_VISIBLE_DEVICES="0,1" If I increase number of gpus, it will occur the above error.
Thank you! I met the same error. This solution works for me with the environment below: pyTorch: 1.9.0+cu111 GPU: 3080Ti * 2 CUDA: 11.1 OS: ubuntu 18.04 Driver: 460.91.03
Any progress?
Any progress?
Seems the fix is working weird when loading checkpoints.. I attached my details in https://github.com/wenet-e2e/wenet/discussions/586#discussioncomment-1345310