Results 21 comments of tzcnt

Any reason you implement i32x8::move_mask but not i8x32::move_mask? Since as you said, i8x32 is the one for which the instruction exists.

Thanks for your interest. v0.1 is coming as soon as I complete the documentation effort and maybe introduce a few additional async primitives, such as heterogeneous multi-await which would enable...

I tried to do this a few different ways. However, there are some constraints: - The only way to communicate information about a coroutine to its `operator new` class method...

I think that allocation performance is a big portion of the diff between TMC and LF. IIRC it can be 15% of the workload in synthetic benchmarks for me (even...

libfork also benefits from being statically linked to the application, whereas tcmalloc is typically linked dynamically. I was able to shave 2-5% off the benchmarks runtime by statically linking tcmalloc.

I ran into an issue in my testing which could also occur with a hypothetical template-allocator task. Consider the following pseudocode example: ```cpp template tmc::task child_task(Allocator& a) { //operator new...

The way most libraries handle this is by injecting a parameter into the coroutine argument list (which is visible to `task::operator new`). I've so far been against this, in the...

A scoped bump allocator tied to a single `spawn_many()` group would be useful. An even more powerful option would be a full "stackful" allocator. For example, the `fib(40)` benchmark only...

When running with something like `docker run --cpuset-cpus="0-2,4,9"` which pins the application to specific cores, hwloc is able to correctly detect 5 total cores across the 3 separate L3 caches....

Per this article https://vsoch.github.io/2023/resources-cgroups-kubernetes/ a fixed cpuset may be allocated by Kubernetes if certain requirements are met - this is probably something to note in the documentation.