Carl Pearson

Results 135 comments of Carl Pearson

I dropped it because I interpreted it to mean that it just enabled some assertions and tests, but now I see that the little benchmarks are referred to as "tests"...

Or perhaps just for pageable allocations

## Exeuctive summary 1. `std::lock_guard` does seem to cause a slowdown, though not always 2. Bad AVX-512 codegen in GCC 10 and 11 makes it way worse. ## Long Version...

I tried this, which basically performed the same as `std::lock_guard`/`std::mutex` ```c++ class GCCSpinLock { int lock_var; public: GCCSpinLock() : lock_var(0) {} void lock() { while (__sync_lock_test_and_set(&lock_var, 1)) { // Spin...

What you suggested performs about the same as the `GCCSpinLock` I posted above (and `std::lock_guard`/`std::mutex`)

Some usage of perf has not added much insight (GCC 10.2.0) Sure enough, Kokkos 4.4 takes 4G more cycles to complete the same number of instructions, but none of the...

Valgrind isn't working for me, (latest release, Valgrind 3.23.0, GCC 14.2.0), for Kokkos 4.3 or 4.4, compiled with `-march=native -mtune=native`. ``` 4.4-patched ==928== Callgrind, a call-graph generating cache profiler ==928==...

*edit*: this is because valgrind only supports up throught AVX2 If I do a release build of Kokkos 4.3 without the `native` flags, valgrind runs, but reports an invalid read...

I think there's a bug in the reproducer: Note `i` goes from [`0`...`_iend`) ```c++ for (int i = 0; i < _iend; ++i) { // ... _output(cl, bf, pt, i)...

> My best bet would be that we are missing out on some compiler optimizations due to the lock and that it's not the lock itself that makes the difference....