relion icon indicating copy to clipboard operation
relion copied to clipboard

Class2D ERROR with cudaMemcpyAsync

Open jmcdonal opened this issue 2 years ago • 2 comments

Hi, I'm an IT professional helping a user with an error that she is seeing with a 2D relion job.
The job goes through 10 iterations then fails on the 11th with the cudaMemcpyAsync error. The job fails but some (orphans) threads continue to run. The four MPI processes continue to run and 10 GPU threads continue to run after the errors.

Environment:

  • OS: CentOS Linux release 7.9.2009 (Core)
  • MPI runtime: openmpi-4.1.0rc5
  • RELION version: 3.1.3-commit-fa923d
  • Memory: 396 GB
  • GPU: 4 X Quadro RTX 6000 / Driver Version: 515.65.01 / CUDA Version: 11.7

Dataset:

  • Box size: [250 px]
  • Pixel size: [0.73 Å/px]
  • Number of particles: [1141934]
  • Description: [A monomeric protein of about 145 kDa in total]

Job options:

  • Type of job: [2D Classification]
  • Number of MPI processes: [4]
  • Number of threads: [32]
  • Full command (see note.txt in the job directory):
    ++++ Executing new job on Tue Sep 13 15:56:43 2022
    

++++ with the following command(s): which relion_refine_mpi --o Class2D/job012/run --i Extract/job009/particles.star --dont_combine_weights_via_disc --pool 3 --pad 2 --ctf --iter 25 --tau2_fudge 2 --particle_diameter 200 --K 50 --flatten_solvent --zero_mask --oversampling 1 --psi_step 12 --offset_range 5 --offset_step 2 --norm --scale --j 32 --gpu "0,1,2,3" --pipeline_control Class2D/job012/ ++++


I can provide the full job log file, if you like.   Here is the snippet when it fails: 

-----------------
Expectation iteration 10 of 25
2.67/7.26 hrs ......................~~(,_,>                                   [oo]
4.62/7.26 hrs .....................................~~(,_,">
6.20/7.26 hrs ..................................................~~(,_,">
7.26/7.26 hrs ............................................................~~(,_,">
Maximization ...
4/   4 sec ............................................................~~(,_,">
Estimating accuracies in the orientational assignment ... 

3/  17 sec ..........~~(,_,">
4/  15 sec ...............~(,_,">
5/  13 sec .....................~~(,_,">
6/  15 sec .......................~~(,_,">
7/  16 sec ..........................~~(,_,">
8/  17 sec ............................~~(,_,">
9/  17 sec ..............................~~(,_,">
10/  18 sec ...............................~~(,_,">
11/  19 sec .................................~~(,_,">
11/  18 sec ...................................~~(,_,">
12/  19 sec .....................................~~(,_,">
13/  20 sec ......................................~~(,_,">
13/  19 sec ........................................~~(,_,">
14/  19 sec ...........................................~~(,_,">
14/  18 sec ............................................~~(,_,">
15/  19 sec ..............................................~~(,_,">
16/  20 sec ...............................................~~(,_,">
17/  20 sec ..................................................~~(,_,">
17/  19 sec ...................................................~~(,_,">
18/  20 sec ....................................................~~(,_,">
18/  19 sec .......................................................~~(,_,">
19/  20 sec ........................................................~~(,_,">
20/  20 sec ..........................................................~~(,_,>
20/  20 sec ...........................................................~~(,_,">
21/  21 sec ............................................................~~(,_,">
Auto-refine: Estimated accuracy angles= 0.8 degrees; offsets= 0.4745 Angstroms
CurrentResolution= 1.78922 Angstroms, which requires orientationSampling of at least 1.02273 degrees for a particle of diameter 200 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 33600
OrientationalSampling= 11.25 NrOrientations= 32
TranslationalSampling= 1.46 NrTranslations= 21
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 1075200
OrientationalSampling= 5.625 NrOrientations= 256
TranslationalSampling= 0.73 NrTranslations= 84
=============================
Expectation iteration 11 of 25
3.02/8.20 hrs ......................~~(,_,>                                   [oo]
5.21/8.19 hrs .....................................~~(,_,">
6.99/8.19 hrs ..................................................~~(,_,">
8.21/8.21 hrs ............................................................~~(,_,">
[1663286718.576383] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576427] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576434] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576440] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576446] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576451] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576457] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576463] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576468] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576473] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576479] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576484] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576489] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576494] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576500] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576505] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576510] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576515] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576520] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576526] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576531] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576537] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576582] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576587] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576592] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576597] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576602] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576607] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576614] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576621] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576627] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576657] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576662] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576667] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576675] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576681] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576708] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576715] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.576818] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.577025] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.577249] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.577460] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.577676] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.577893] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.578222] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.578334] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.578543] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.578759] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.579027] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.579250] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.579482] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.579705] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.579927] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.580181] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument
[1663286718.580390] [jflab01:95930:0]   cuda_copy_ep.c:77   UCX  ERROR cudaMemcpyAsync(dst, src, length, direction, iface->stream[id])() failed: invalid argument



jmcdonal avatar Sep 21 '22 14:09 jmcdonal

Is this reproducible on other machines, or after rebooting the machine? Did you check the GPU memory and GPU power supply are fine by GPU stress test?

Did you try RELION 4.0 beta?

biochem-fan avatar Sep 21 '22 22:09 biochem-fan

We have been able to reproduce the error on the same system after a clean reboot. There are no indications in any of the system logs to indicate a GPU has had an issue. We have not run specifically a stress test but we can try that. We also have another machine we can try running it on and the user has the option to try version 4 as well.

jmcdonal avatar Sep 22 '22 21:09 jmcdonal