BabelStream
BabelStream copied to clipboard
STREAM, for lots of devices written in many programming models
This is a new implementation of BabelStream using Fortran. The code uses a Fortran driver that is largely equivalent to the C++ one, with a few exceptions. First, it does...
I kept getting ``` terminate called after throwing an instance of 'std::runtime_error' what(): Device does not have enough memory for all 3 buffers ``` when running with 1Gi elements on...
Not a bug, just FYI, but on A100, increasing `DOT_NUM_BLOCKS` increases the performance a noticeable amount. I don't see any documentation of the need to tune this. It's possible that...
There are several instances where the double constant `0.0` is used in a way that promotes everything it touches. For example: https://github.com/UoB-HPC/BabelStream/blob/1d423fc70dd573b528ee43f521401277731b443a/src/std-data/STDDataStream.cpp#L85 In this case, the value is used on...
oneTBB works well when used as a CMake FetchContent dependency. By doing this, TBB and the benchmark can be configured and compiled together which allows TBB to make better decisions...
As main memory sizes increase, we are seeing errors for very large input sizes, passed in via the command line argument `--arraysize`. This reads in an `int` which can store...
[Numba](https://developer.nvidia.com/how-to-cuda-python) seems to be the *Nvidia recognised* way of CUDA programming with Python. Numba supports direct kernel programming similar to how it's done in Julia where the annotated code/method is...
The OpenMP CPU version allocates twice the memory it needs restricting the maximum problem size. For offload models, this is OK as the data needs to exists on the device...
So it appears that instead of calling `++` or separate `+` and `=` like in libstdc++ and `-stdpar=gpu`, `-stdpar=multicore` calls `+=`. ``` "/lustre/home/br-wlin/nvhpc_sdk/Linux_x86_64/22.1/compilers/include-stdpar/thrust/system/detail/generic/advance.inl", line 48: error: no operator "+=" matches...
1. SYCL performance on NVIDIA A100 is currently 2-3% worse than native CUDA. Inspection of the PTX generated by SYCL shows extra parameters and instructions due to accessor and buffers....