Valentin Churavy
Valentin Churavy
I am a bit uncertain about HistoricalStdlibVersions.jl It currently has: ``` UUID("0dad84c5-d112-42e6-8d28-ef12dabb789f") => ("ArgTools", v"1.1.1"), UUID("4af54fe1-eca0-43a8-85a7-787d91b784e3") => ("LazyArtifacts", nothing), ``` Both are registered. The version we have for ArgTools on...
I am skewed towards providing a more performant default, so `ch4` get's my vote
#10249 switched to `ch4`
Can you post a profile https://cuda.juliagpu.org/stable/development/profiling/#Integrated-profiler so that we can determine if the overhead is in the kernel or the kernel launch.
If you changed the problem size then you need to change the number of blocks. ``` julia> CUDA.@profile batched_dot_cuda!(o, x, y; threads=32, blocks=round(Int, length(o)/32)) ```
Ok that is still surprising to me. I expect some overhead but nothing that should scale like that.
What is `CUDA.versioninfo()` Running this locally on a `Quadro RTX 4000`: ``` Device-side activity: GPU was busy for 1.98 ms (10.55% of the trace) ┌──────────┬────────────┬───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ Time (%) │ Total...
Okay that makes it even more curious. We are looking at the same generation of GPUs mine should be about 2x slower than yours, which matches. Could you run the...
Yeah Const ends up as `ldg`, but it's fascinating that this leads to a performance delta on pro-sumer chips You can also verify this with using Const and CUDA directly...
> for small inputs KA becomes progressively slower in comparison. just curious why. KA adds some additional integer operations for the index calculations and defaults to Int64. Reducing that overhead...