silentarmy
silentarmy copied to clipboard
Poor performance on pre-Maxwell Nvidia GPUs (local memory atomics)
The commit: https://github.com/mbevand/silentarmy/commit/b879b795141b95d0878e93bd9ae5cab120149891
Before this commit, hashrates were ok.
I use 3 Threads on 4GB GTX 760. Before: ~15H/s After: ~7-9H/s
I am running on arch linux with nvidia 375.10 and Cuda 8.0 installed
Confirmed. K2200 with v4+extremal's patches 22-25S/s , latest v5 9-14S/s.
Atomic ops are here in play as they are not optimized for older architectures.
Thanks for the report, I did not test on older Nvidia gear before committing. I think it's probably usage of local memory, not atomics, that hamper performance.
Can you guys try this: edit param.h, find this line:
#define COLL_DATA_SIZE_PER_TH (NR_SLOTS * 5)
And change "5" to a value between 1 and 5. Recompile and test. See if it improves performance in any way.
Until maxwell there is was no shared(eq of local in opencl) memory atomics in hardware, they were software emulated, so it is dead slow on arch < maxwell. Proof: https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/
On GTX 750 (1 Gb, CUDA 7.5) best result (around 18 sol/s) with next values:
#define NR_ROWS_LOG 19 #define OPTIM_SIMPLIFY_ROUND 1 #define COLL_DATA_SIZE_PER_TH (NR_SLOTS * 2)
Seems to be the best spec... interesting that tromp's cuda gives a more stable sol count overall. Silentarmy is jumping up and down. ID 0: GeForce GT 740M #define NR_ROWS_LOG 20 #define OPTIM_SIMPLIFY_ROUND 1 #define COLL_DATA_SIZE_PER_TH (NR_SLOTS * 3)
@tupieurods I think we are using global atomics.
// I was wrong.
@tupieurods You are right. Didn't know shared atomics were not hardware implemented pre-Maxwell. That' s definitively the cause of the slowdown then, because this commit makes heavy use of shared atomics. I see no solution other than maintaining a 2nd separate version of input.cl specifically for pre-Maxwell Nvidia GPUs then.
In the mean time, the workaround is for pre-Maxwell users to revert to SA v4, or more specifically to the last revision not using local atomics. After a git clone:
$ cd silentarmy && git checkout 243ed569bac5e17305825645023296ccf09c6eeb
Nope, that command give us a slow version. I have 25-26 Sol/s with a custom version I've forked before 2 or 3 patches. Not sure why, I will try to publish it on Github this week-end.