BabelStream issues

Fortran ports

5

This is a new implementation of BabelStream using Fortran. The code uses a Fortran driver that is largely equivalent to the C++ one, with a few exceptions. First, it does...

jeffhammond

overflow in CUDA

7

I kept getting ``` terminate called after throwing an instance of 'std::runtime_error' what(): Device does not have enough memory for all 3 buffers ``` when running with 1Gi elements on...

jeffhammond

CUDA dot tuning

5

Not a bug, just FYI, but on A100, increasing `DOT_NUM_BLOCKS` increases the performance a noticeable amount. I don't see any documentation of the need to tune this. It's possible that...

jeffhammond

Always use the correct numeric type for all kernels

There are several instances where the double constant `0.0` is used in a way that promotes everything it touches. For example: https://github.com/UoB-HPC/BabelStream/blob/1d423fc70dd573b528ee43f521401277731b443a/src/std-data/STDDataStream.cpp#L85 In this case, the value is used on...

tom91136

Add oneTBB as a FetchContent dependency

1

oneTBB works well when used as a CMake FetchContent dependency. By doing this, TBB and the benchmark can be configured and compiled together which allows TBB to make better decisions...

tom91136

WIP: support massive input sizes

3

As main memory sizes increase, we are seeing errors for very large input sizes, passed in via the command line argument `--arraysize`. This reads in an `int` which can store...

tomdeakin

Add Python implementation

[Numba](https://developer.nvidia.com/how-to-cuda-python) seems to be the *Nvidia recognised* way of CUDA programming with Python. Numba supports direct kernel programming similar to how it's done in Julia where the annotated code/method is...

tom91136

enhancement

Update verification check to save memory

1

The OpenMP CPU version allocates twice the memory it needs restricting the maximum problem size. For offload models, this is OK as the data needs to exists on the device...

tomdeakin

NVHPC needs the `+=` operator to be implemented for std-indices on stdpar=multicore

So it appears that instead of calling `++` or separate `+` and `=` like in libstdc++ and `-stdpar=gpu`, `-stdpar=multicore` calls `+=`. ``` "/lustre/home/br-wlin/nvhpc_sdk/Linux_x86_64/22.1/compilers/include-stdpar/thrust/system/detail/generic/advance.inl", line 48: error: no operator "+=" matches...

tom91136

Additional SYCL USM (device pointer explicit copy) and CUDA tuning for DOT

8

1. SYCL performance on NVIDIA A100 is currently 2-3% worse than native CUDA. Inspection of the PTX generated by SYCL shows extra parameters and instructions due to accessor and buffers....

lfmeadow

BabelStream
BabelStream copied to clipboard

Metadata

Fortran ports

overflow in CUDA

CUDA dot tuning

Always use the correct numeric type for all kernels

Add oneTBB as a FetchContent dependency

WIP: support massive input sizes

Add Python implementation

Update verification check to save memory

NVHPC needs the `+=` operator to be implemented for std-indices on stdpar=multicore

Additional SYCL USM (device pointer explicit copy) and CUDA tuning for DOT

← Metadata

Owner

Metadata

BabelStream BabelStream copied to clipboard

Metadata

← Metadata

Owner

Metadata

BabelStream
BabelStream copied to clipboard