Tobias Ribizel
Tobias Ribizel
We have so far used only 24% of our CPU time and 80% of our GPU time - CPU time is not our limiting factor here, so it doesn't hurt...
I don't think these numbers provide any useful information - they are for shared builds on a login node with something like 8 threads, while a full node has >...
Putting some numbers behind it, building Release with CUDA 11.8 + GCC 11 on Horeka takes ~400s and running all tests with ctest resources also takes ~400-430 s. (debug build...
Thanks for the report! Most of the failures are just our test tolerances being a bit too tight. The Coo test failure looks concerning though, and the MPI tests look...
> Any ideas why it may fail? Probably atomics, since this is to my knowledge the only kernel where we use float atomics in OpenMP (handling overlaps between threads). This...
Are those atomics also supported by the OpenMP implementation? We use `#pragma omp atomic` in an attempt to be portable, and only require 32 bit and 64 bit atomics.
Endianness only matters for the `extended_float` tests, where we assume little endian. Should be an easy fix though We rely on CMake's FindOpenMP module adding the right flags, so if...
Suggestion to make writing such functionality and passing communication data easier: We should add a `distributed::RowGatherer` (or RowGather? Gatherer is a handful) LinOp that does what `communicate` does inside `distributed::Matrix`....
`glu_experimental` will hopefully be replaced by `develop` soon, so we probably won't be able to spend any time addressing this directly.
It could break existing code either way, since previously the assignment operators were implicitly declared. But we already changed the interface for things that are clearly misusing our interfaces (gko::share...