Hamish Ivey-Law

Results 111 issues of Hamish Ivey-Law

See https://nvlabs.github.io/xmp/

Even when the RHS is known at compile time, the need to manage sign extension issues (double-check this is actually the reason) makes div and mod slower with signed RHS...

I have lost literally weeks of productive time chasing bugs around this code base. I should document what happened and how they were resolved when this occurs.

- Use cuda-memcheck of course. - Use Google's [libasan](https://github.com/google/sanitizers/wiki/AddressSanitizer) for address sanitisation. - `-fsanitize=address` - More ideas from Brandy's CppCon 2017 talk "C++ bugs"

Useful in general obviously, but also for performance regressions. Some relevant links: - https://danluu.com/perf-tracing/ - https://github.com/RRZE-HPC/likwid - https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-it - https://stackoverflow.com/questions/8389648/how-do-i-achieve-the-theoretical-maximum-of-4-flops-per-cycle?rq=1 - https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record - https://danluu.com/assembly-intrinsics/ - https://danluu.com/new-cpu-features/

Most things at https://github.com/calccrypto/ are relevant; especially the [uint256_t library](https://github.com/calccrypto/uint256_t).

From https://github.com/data61/cuda-fixnum/issues/27: > Potentially useful instructions include > > - min and max without branching > - sum of absolute differences: `sad.u32` > - [funnel shift](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf) > > Note that...

_I.e._ the FFT over finite fields. - Consider the work done in cuFHE [here](https://github.com/vernamlab/cuFHE/tree/master/cufhe/include/ntt_gpu). - Relevant article: http://www.csd.uwo.ca/~moreno/Publications/Moreno-Maza.Pan.HPCS-2010.pdf

See p12 at https://web.maths.unsw.edu.au/~davidharvey/talks/fastntt-2-talk.pdf

See p17 of https://gmplib.org/~tege/speed.pdf