Philip Turner comments

Results 340 comments of


                                            Philip Turner

Batch computations

Apple is the primary victim of this latency bottleneck. Their drivers have very high latencies, and little opportunity for running multiple GPU commands in parallel. AMD is in the middle...

Batch computations

Nvidia has multi-instance GPU but none of the other vendors have that.

Investigate slow OpenCL performance on AMD

There is one way to quickly test the theory. Perform multiple instances of the simulation in parallel, utilizing multiple CPU cores. I hypothesize that running two simulations will double the...

Investigate slow OpenCL performance on AMD

How about we narrow down the problem? For example, remove one of the kernels that isn't causing a gap. See whether the gap persists. Repeat the procedure, you get the...

Investigate slow OpenCL performance on AMD

> I commented out the enqueueReadBuffer and the code that relied on it. Performance is through the roof on gbsa, 410 ns/day -> 3030 ns/day! You commented out the part...

Investigate slow OpenCL performance on AMD

I suggested a better way to examine the problem, run two simulations in parallel. This requires a more complex setup than `benchmark.py`. It's particularly important to my use case because,...

Investigate slow OpenCL performance on AMD

`gbsa`, `rf`, and `amoebapme` seem messed up. Elsewhere, there was a minor performance boost. Faster simulations (where latency would impact more) received a greater speedup, indicating it was from not...

Investigate slow OpenCL performance on AMD

> Can this be known ahead of time in more situations? If so, wrapping the enqueueReadBuffer in an if statement would be a good first step. Is this also the...

Investigate slow OpenCL performance on AMD

@egallicc would you mind comparing power consumption to ns/day on HIP like I did, but without erasing `downloadCountEvent`? Then we can investigate whether HIP reached a genuine 1000 ns/day on...

Investigate slow OpenCL performance on AMD

Is this the data transfer from GPU -> CPU? Perhaps it's avoidable using indirect command buffers in Metal, HIP, CUDA, OpenCL 2.0. Have the GPU encode any commands depending on...