less_slow.cpp icon indicating copy to clipboard operation
less_slow.cpp copied to clipboard

Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO

Results 13 less_slow.cpp issues
Sort by recently updated
recently updated
newest added

The loop in *f32_pairwise_accumulation* have `f32s_in_cache_line_half_k` * 2 times, and the other one only have `f32s_in_cache_line_half_k` times. ![图片](https://github.com/user-attachments/assets/ad7a0374-48a2-409d-aaf8-5549c664bc48)

The current minimalistic RPC implementation using `io_uring` avoids certain more advanced features: - `IORING_REGISTER_BUFFERS` - since 5.1 - `IORING_RECV_MULTISHOT` or `io_uring_prep_recvmsg_multishot` - since 6.0 - `IORING_OP_SEND_ZC` or `io_uring_prep_sendmsg_zc` - since...

enhancement
help wanted
good first issue

We currently have a pretty extensive list of graph storage and processing approaches on the CPU, but lack GPU analogs. This can be a great opportunity to showcase both the...

enhancement
help wanted
good first issue

I'm not a massive fan of ASIO, Boost.ASIO and the NetworkingTS that builds on top of them. I'm also not a great user. My current implementation on the `asio-uring-web-server` branch...

bug
help wanted
good first issue

CUDA natively supports Fused-Multiply-Accumulate operations for every float type, including `f16` and `bf16`. It also provides DP4A instructions for 8-bit integer dot-products with 32-bit accumulators and `umul24` instructions for 24-bit...

It would be great to have working examples for "encrypted enclaves" and other "secure computing" technologies, like Intel SGX, AMD SEV, and ARM Realm. Sadly, I couldn't get them to...

enhancement
help wanted
good first issue

Most C++ developers have heard of `std::mutex` and `std::shared_mutex`, and some may have implemented their oversimplified versions using `std::atomic`. However, there have been few attempts to design a scalable shared...

enhancement
help wanted
good first issue

LibUnifex is likely the best preview of what programming many-core CPU in C++ may look like in a few years. However, my first attempts to apply it to toy problems...

bug
help wanted
good first issue

1. Run `sorting_with_executors` benchmark using the `std::execution::par_unseq` policy. 2. Memory consumption quickly exceeds my machine's availability of 60GB after the second test variant finishes: `sorting_with_executors/par_unseq/4194304/` Observations: 1. Memory does not...

bug