silentarmy icon indicating copy to clipboard operation
silentarmy copied to clipboard

Poor performance on pre-Maxwell Nvidia GPUs (local memory atomics)

Open computerlyrik opened this issue 8 years ago • 9 comments

The commit: https://github.com/mbevand/silentarmy/commit/b879b795141b95d0878e93bd9ae5cab120149891

Before this commit, hashrates were ok.

I use 3 Threads on 4GB GTX 760. Before: ~15H/s After: ~7-9H/s

I am running on arch linux with nvidia 375.10 and Cuda 8.0 installed

computerlyrik avatar Nov 12 '16 12:11 computerlyrik

Confirmed. K2200 with v4+extremal's patches 22-25S/s , latest v5 9-14S/s.

ddobreff avatar Nov 12 '16 13:11 ddobreff

Atomic ops are here in play as they are not optimized for older architectures.

Kubuxu avatar Nov 12 '16 15:11 Kubuxu

Thanks for the report, I did not test on older Nvidia gear before committing. I think it's probably usage of local memory, not atomics, that hamper performance.

Can you guys try this: edit param.h, find this line:

#define COLL_DATA_SIZE_PER_TH (NR_SLOTS * 5)

And change "5" to a value between 1 and 5. Recompile and test. See if it improves performance in any way.

mbevand avatar Nov 12 '16 17:11 mbevand

Until maxwell there is was no shared(eq of local in opencl) memory atomics in hardware, they were software emulated, so it is dead slow on arch < maxwell. Proof: https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/

tupieurods avatar Nov 12 '16 18:11 tupieurods

On GTX 750 (1 Gb, CUDA 7.5) best result (around 18 sol/s) with next values:

#define NR_ROWS_LOG 19 #define OPTIM_SIMPLIFY_ROUND 1 #define COLL_DATA_SIZE_PER_TH (NR_SLOTS * 2)

blackjec69 avatar Nov 12 '16 18:11 blackjec69

Seems to be the best spec... interesting that tromp's cuda gives a more stable sol count overall. Silentarmy is jumping up and down. ID 0: GeForce GT 740M #define NR_ROWS_LOG 20 #define OPTIM_SIMPLIFY_ROUND 1 #define COLL_DATA_SIZE_PER_TH (NR_SLOTS * 3)

montvid avatar Nov 12 '16 19:11 montvid

@tupieurods I think we are using global atomics.

// I was wrong.

Kubuxu avatar Nov 12 '16 19:11 Kubuxu

@tupieurods You are right. Didn't know shared atomics were not hardware implemented pre-Maxwell. That' s definitively the cause of the slowdown then, because this commit makes heavy use of shared atomics. I see no solution other than maintaining a 2nd separate version of input.cl specifically for pre-Maxwell Nvidia GPUs then.

In the mean time, the workaround is for pre-Maxwell users to revert to SA v4, or more specifically to the last revision not using local atomics. After a git clone:

$ cd silentarmy && git checkout 243ed569bac5e17305825645023296ccf09c6eeb

mbevand avatar Nov 12 '16 20:11 mbevand

Nope, that command give us a slow version. I have 25-26 Sol/s with a custom version I've forked before 2 or 3 patches. Not sure why, I will try to publish it on Github this week-end.

Singman33 avatar Nov 17 '16 19:11 Singman33