cpuminer-neoscrypt
cpuminer-neoscrypt copied to clipboard
Added AVX1 support for salsa and chacha rounds
Code is in C for better maintainabilty. ASM derived from these files might increase speed slightly. Current speed increase compared to SSE routines about 10%
Thanks, I plan to add the AVX/XOP assembly code in the future and may use your inline assembly as a reference. SSE2 4-way is also going to be improved.
I was kind of wondering where your speed increase from the 4-way is originated. Guess I still have to rewrite the the KDF compress to inline assembly. I guess this function is a good candidate to optimize to 4-way, or maybe 8-way, depending on the XMM requirements.
The original CpuMiner had a scrypt 3-way and a SHA256 4-way, resulting is the best result running a 12-way on AVX1. Scrypt 3-way contained 3 'matrices' in XMM registers, keeping 4 XMM register free for calculating functions etc. It seems that XMM//XMM operations run 3 times faster then XMM//Memory operations.
Due to the mixing behavior (4 times a 4x4 matrix) of neo-scrypt it looks like that for salsa and cha-cha 1-way would need the minimum of memory moves.
Unfortunately my development environment doesn't have AVX2, but the in-line assembly code could easily be rewritten to support the 256bits YMM registers.
John Doering schreef op 2/17/2015 om 5:14 PM:
Thanks, I plan to add the AVX/XOP assembly code in the future and may use your inline assembly as a reference. SSE2 4-way is also going to be improved.
— Reply to this email directly or view it on GitHub https://github.com/ghostlander/cpuminer-neoscrypt/pull/1#issuecomment-74694580.