less_slow.cpp
less_slow.cpp copied to clipboard
Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
The loop in *f32_pairwise_accumulation* have `f32s_in_cache_line_half_k` * 2 times, and the other one only have `f32s_in_cache_line_half_k` times. 
The current minimalistic RPC implementation using `io_uring` avoids certain more advanced features: - `IORING_REGISTER_BUFFERS` - since 5.1 - `IORING_RECV_MULTISHOT` or `io_uring_prep_recvmsg_multishot` - since 6.0 - `IORING_OP_SEND_ZC` or `io_uring_prep_sendmsg_zc` - since...
We currently have a pretty extensive list of graph storage and processing approaches on the CPU, but lack GPU analogs. This can be a great opportunity to showcase both the...
I'm not a massive fan of ASIO, Boost.ASIO and the NetworkingTS that builds on top of them. I'm also not a great user. My current implementation on the `asio-uring-web-server` branch...
CUDA natively supports Fused-Multiply-Accumulate operations for every float type, including `f16` and `bf16`. It also provides DP4A instructions for 8-bit integer dot-products with 32-bit accumulators and `umul24` instructions for 24-bit...
It would be great to have working examples for "encrypted enclaves" and other "secure computing" technologies, like Intel SGX, AMD SEV, and ARM Realm. Sadly, I couldn't get them to...
Most C++ developers have heard of `std::mutex` and `std::shared_mutex`, and some may have implemented their oversimplified versions using `std::atomic`. However, there have been few attempts to design a scalable shared...
LibUnifex is likely the best preview of what programming many-core CPU in C++ may look like in a few years. However, my first attempts to apply it to toy problems...
1. Run `sorting_with_executors` benchmark using the `std::execution::par_unseq` policy. 2. Memory consumption quickly exceeds my machine's availability of 60GB after the second test variant finishes: `sorting_with_executors/par_unseq/4194304/` Observations: 1. Memory does not...