Szilárd Páll issues

Results 10 issues of


                                            Szilárd Páll

add documentation

Documentation of the config file as well as full man pages are missing.

**Describe the motivation for the feature request** Currently setting LD_LIBRARY_PATH before launching an application that uses hipSYCL is requires. **Describe the solution you'd like** Not have to set LD_LIBRARY_PATH to...

enhancement

clBuildProgram segv

The following change that only does code refectoring of the GROMACS OpenCL kernels causes the OpenCL compiler to crash: https://gerrit.gromacs.org/#/c/7810/19/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl_kernel_utils.clh The culprit has been isolated to the linked changes on...

clFFT unit tests fail with ROCm 1.9

Multiple clFFT tests fail on both Vega10 and Fiji with ROCm 1.9. Repro ingredients ROCm 1.9 ``` $ dpkg -l | grep rocm-opencl ii rocm-opencl 1.2.0-2018090737 amd64 OpenCL/ROCm ii rocm-opencl-dev...

compilation fails with -cl-opt-disable

When compiling with `-cl-opt-disable`, I get the following errors (one for each kernel function): ``` : error: can't create dynamic relocation R_AMDGPU_REL32_LO against symbol: norm2 in readonly segment; recompile object...

possible deadlock in clFinish

GROMACS runs that seemed fine before stall and fail to complete since the last ROCm update. Symptoms: with small inputs that run ~100s of microseconds per iteration (one clFinish per...

[RFE] allow using rocFFT kernels with `clCreateProgramWithBinary`

As rocFFT does not have OpenCL bindings a relatively easy way (as suggested [here](https://github.com/ROCmSoftwarePlatform/rocFFT/issues/120#issuecomment-380488475) would be to load rocFFT binaries with `clCreateProgramWithBinary` to be able to use them in an...

[RFC] atomic operation support on 32-bit floating point

Would like to ideally have atomic_add(); I'm assuming the hardware supports resolving conflicts.

gmres_device_solve performance bottlenecks

During `gmres_device_solve` there are ilde gaps in the GPU utilization due to (IIUC): * global reductions (following glsc3_reduce_kernel) and * a small cpu kernel https://github.com/ExtremeFLOW/neko/blob/1753fa9e89bd83e52a704523acfa103c0fb0cbc3/src/krylov/bcknd/device/gmres_device.F90#L435 Both of these are preceded...

GPU

performance

non-overlapped D2D memcopies and memsets

I see a number of memset and device-to-device memcopies none of which is overlapped with compute. Based on a Leonardo TGV 256k run there are up to ~3% wall-time spent...

GPU

performance