René Widera
René Widera
I am currently implementing fast atomic operations for `AtomicOmpCritSec` and found that `atomicMin`, `atomicMax` and `compare and swap` can not be implemented to fulfill case 4 :-( without a for...
IMO case 4 is fulfilled for CUDA, I can't found any source which says something other. A test is not so easy because to test if case 4 is fulfilled...
If I implement `AtomicOmpCritSec` with `omp atomic capture` is will be faster than the new AtomicStlLock #398 but only supports fundamental types. btw: CUDA also supports a hand full types....
Not sure case 3 means that you can mix 64 and 32 bit types because only the address counts.
Is there anything against the case 4 definition from your side? If not I will implement the new omp atomics as case 4. e.g. ```C++ T old; auto & ref(*addr);...
The CI is running for the dev branch after a merge because a PR can be based on an old development branch. If you merge two PRs in a row...
To reduce the load I suggest testing for PRs and merging development fewer combinations per compiler and performing a full matrix test once per week. This should catch most issues...
Another way to reduce the CI load is by staging the tests. This means you test first a handful of combinations and only on successes perform more tests. For example,...
for the record: PIConGPU is using a CI job generator to avoid that a combination being tested twice. https://github.com/ComputationalRadiationPhysics/picongpu/blob/dev/share/ci/n_wise_generator.py The generator is spaghetti phython but is only running valid combinations...
To reduce the CI job runtime we should maybe run the header include check called `headerCheck` in a separate job and only once for each compiler e.g. gcc, clang, hipp,...