Hamish Ivey-Law issues

Results 111 issues of


                                            Hamish Ivey-Law

Benchmark against nVidia's XMP library

See https://nvlabs.github.io/xmp/

Use only unsigned values for RHS of compile-time div and mod

Even when the RHS is known at compile time, the need to manage sign extension issues (double-check this is actually the reason) makes div and mod slower with signed RHS...

Start wiki page for bug post-mortem analyses

I have lost literally weeks of productive time chasing bugs around this code base. I should document what happened and how they were resolved when this occurs.

Automate memcheck/address sanitisation checks

- Use cuda-memcheck of course. - Use Google's [libasan](https://github.com/google/sanitizers/wiki/AddressSanitizer) for address sanitisation. - `-fsanitize=address` - More ideas from Brandy's CppCon 2017 talk "C++ bugs"

Implement systematic profiling

Useful in general obviously, but also for performance regressions. Some relevant links: - https://danluu.com/perf-tracing/ - https://github.com/RRZE-HPC/likwid - https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-it - https://stackoverflow.com/questions/8389648/how-do-i-achieve-the-theoretical-maximum-of-4-flops-per-cycle?rq=1 - https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record - https://danluu.com/assembly-intrinsics/ - https://danluu.com/new-cpu-features/

Consider what can be learned from the CalcCrypto library

Most things at https://github.com/calccrypto/ are relevant; especially the [uint256_t library](https://github.com/calccrypto/uint256_t).

Investigate use of other PTX instructions in arithmetic implementations

From https://github.com/data61/cuda-fixnum/issues/27: > Potentially useful instructions include > > - min and max without branching > - sum of absolute differences: `sad.u32` > - [funnel shift](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf) > > Note that...

Hamish Ivey-Law