KernelAbstractions.jl issues

Slow simple 2D copy kernel with Metal backend

2

Hi, I try to use KA for the first time and I wonder about the performance I obtain for a simple kernel copying 2 2D matrices of Float32 (I know...

LaurentPlagne

Allow `return` or `return nothing` in kernels

2

I use `JuliaFormatter` to format my source code. This adds explicit `return` statements. This leads to ``` ERROR: LoadError: Return statement not permitted in a kernel function sum2_kernel! ``` even...

eschnett

Test 1.10 on Metal and oneAPI

Apparently, DiffEqGPU is failing on v1.10 on Metal: https://buildkite.com/julialang/diffeqgpu-dot-jl/builds/1006#018cf9e1-e6db-42da-b270-1afbf733a6d4

utkarsh530

MAX support?

1

Hi! Thank you very much for this project. I'm working on a kernel where I need to do a "max". For example `a = max(1,2) ` But I'm getting this...

ManuelCostanzo

Add Feature to Select Devices to Execute Kernels On

5

It would be great in a multi-device system, the device id that will run a KA kernel could be set through a function call. cc: @vchuravy

matinraayai

CPU `__thread_run` could loop over CartesianIndices?

7

I noticed in Stencils.jl that when I'm using a fast stencil (e.g. 3x3 window summing over a `Matrix{Bool}`) that the indexing in `__thread_run` takes longer than actually reading and summing...

rafaqz

Looping over high-dimensional arrays

7

Together with @weymouth we are trying to create a kernel that loops over an n-dimensional array and applies a function to each element. While we can certainly achieve to do...

b-fg

groupreduction and subgroupreduction

10

I am unsure why my previous PR closed but here are the changes. - I added docs - I added tests It was my first time writing tests, and they...

brabreda

On CPU always use `NoDynamicCheck()`, just finish the last partial workgroup with `DynamicCheck()`

1

Given that `DynamicCheck()` breaks SIMD this can be an order of magnitude faster for some inexpensive tasks. I'll write up a better MWE, but this is the scale of it...

rafaqz

exposing warp-level semantics

12

I had a request from a user to use warp-level semantics from CUDA: `sync_warp`, `warpsize`, and stuff here: https://cuda.juliagpu.org/stable/api/kernel/#Warp-level-functions. They seem to be available here: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html, but I don't know...

leios

KernelAbstractions.jl
KernelAbstractions.jl copied to clipboard

Metadata

Slow simple 2D copy kernel with Metal backend

Allow `return` or `return nothing` in kernels

Test 1.10 on Metal and oneAPI

MAX support?

Add Feature to Select Devices to Execute Kernels On

CPU `__thread_run` could loop over CartesianIndices?

Looping over high-dimensional arrays

groupreduction and subgroupreduction

On CPU always use `NoDynamicCheck()`, just finish the last partial workgroup with `DynamicCheck()`

exposing warp-level semantics

← Metadata

Owner

Metadata

KernelAbstractions.jl KernelAbstractions.jl copied to clipboard

Metadata

← Metadata

Owner

Metadata

KernelAbstractions.jl
KernelAbstractions.jl copied to clipboard