TIGRE icon indicating copy to clipboard operation
TIGRE copied to clipboard

MATLAB TIGRE failing in Quadro P5000

Open AnderBiguri opened this issue 3 years ago • 12 comments

Unfortunately this is second hand-information, so I can't really know exactly whats wrong, but someone with Quadro P5000 complains of "unspecified launch failure" errors, particularly on 2-GPU machines in MATLAB

this is interesting, as the error seems to be related to segfaults in the GPU code, but no one else seems to have (as far as I know) the same issue, so not sure what it may be. Sometimes this is caused by block/thread sizes etc.

If anyone else has a P5000, can you test TIGRE there and report back to me, so I can have more info about this?

AnderBiguri avatar Nov 17 '20 23:11 AnderBiguri

I have had the similar problem on two-GPU (also Quadro P5000) before. After changed to newer version of CUDA and recompiled, the problem went away. @AnderBiguri

yliu88au avatar Dec 04 '20 00:12 yliu88au

@AnderBiguri Sorry, after further test, OS_AwASD_POCS (MATLAB) still crashes on the 2-P5000 machine. MLEM algorithm runs fine on the same computer (means Ax and Atb are good).

yliu88au avatar Dec 04 '20 05:12 yliu88au

Further investigation shows that algorithms crashes at line f=minimizeTV(f0,dtvg,ng); or f=minimizeAwTV(f0,dtvg,ng,delta); on a computer with 2 P5000 GPUs (MATLAB). Hope this information helps.

yliu88au avatar Dec 04 '20 06:12 yliu88au

@yliu88au yes that is actually very useful information.

Its interesting because I have use those algorithms in 2 GPU machines well, but apparently either a particular configuration, or the P5000s are making this crash.

AnderBiguri avatar Dec 04 '20 09:12 AnderBiguri

With all new PR incorporated (PR #221 and PR #228), both MATLAB and Python versions crash for 2D cases only and 3D case runs OK on this dual P5000 GPU computer, and with algorithms involving either POCS_TV.cu or POCS_TV2.cu. Same examples (2D and 3D cases of both versions) run OK on single GPU computers. @AnderBiguri

yliu88au avatar Dec 21 '20 01:12 yliu88au

@AnderBiguri Further tests show that, for problem with geo.nVoxel=[x,y,z], if z=1,2,3, POCS algorithms crash on both MATLAB and Python version on dual GPUs (P5000) computer. However, if z>=4, all runs OK. I am wondering if this is true for other dual GPU computers?

yliu88au avatar Dec 21 '20 10:12 yliu88au

@yliu88au it is likely true for most. I suspect a bug on the multi GPU code. That information helps a lot, its likely that the special case of small Z is not handled correctly when splitting the image among the gpus

AnderBiguri avatar Dec 21 '20 10:12 AnderBiguri

@AnderBiguri Yeah, the multi-GPU code section in say Siddon_projection.cu looks quite different from that of POCS_TV(2).cu.

yliu88au avatar Dec 21 '20 11:12 yliu88au

@yliu88au actually that is expected. At this lower level, the particular problem that we are trying to solve requires a radically different code to make it parallelizable. I think the issue is somewhere here (and subsequent lines)

https://github.com/CERN/TIGRE/blob/41ffa73421512b49515e749a7d809d8d8c0af948/MATLAB/Source/POCS_TV.cu#L309

Particularly the section where there are multiple GPUs, but the entire problem could be solved by 1 GPU, so parallelization is not mandatory, just benefits the problem. As each GPU only gets the image partially (split in the Z axis direction), I think there is an issue when z<4, as the TV requires 2 slices to compute itself (pixel wise differences). Each GPU should get minimum 2 slices (i.e. z=4) to function properly. This is a bug, of course, the CUDA code should be able to handle this particular situation with an special case.

AnderBiguri avatar Dec 21 '20 12:12 AnderBiguri

@AnderBiguri I see. I have found a quick (may be not optimal) solution to this problem. Just use single GPU when problem is thin (i.e., z<4), by adding one line above line 309 as:

    // if it is a thin problem (no need to split), just use one GPU
    if (image_size[2]<4){deviceCount=1;}

I have tested several examples with z=1,2,3 for both POCS_TV.cu and POCS_TV2.cu, and it works for both MATLAB and Python on the dual P5000 GPU computer.

yliu88au avatar Dec 22 '20 05:12 yliu88au

I will still leave this open, as the original problem was in issues bigger than z=4, thus PR #230 was only partial, as far as I know.

AnderBiguri avatar Dec 22 '20 15:12 AnderBiguri

Agreed. Since #230 is only a simple temporary fix for these cases when z<4. Should have better solutions. @AnderBiguri

yliu88au avatar Dec 22 '20 23:12 yliu88au