Philip Turner

Results 340 comments of Philip Turner

Apple is the primary victim of this latency bottleneck. Their drivers have very high latencies, and little opportunity for running multiple GPU commands in parallel. AMD is in the middle...

Nvidia has multi-instance GPU but none of the other vendors have that.

There is one way to quickly test the theory. Perform multiple instances of the simulation in parallel, utilizing multiple CPU cores. I hypothesize that running two simulations will double the...

How about we narrow down the problem? For example, remove one of the kernels that isn't causing a gap. See whether the gap persists. Repeat the procedure, you get the...

> I commented out the enqueueReadBuffer and the code that relied on it. Performance is through the roof on gbsa, 410 ns/day -> 3030 ns/day! You commented out the part...

I suggested a better way to examine the problem, run two simulations in parallel. This requires a more complex setup than `benchmark.py`. It's particularly important to my use case because,...

`gbsa`, `rf`, and `amoebapme` seem messed up. Elsewhere, there was a minor performance boost. Faster simulations (where latency would impact more) received a greater speedup, indicating it was from not...

> Can this be known ahead of time in more situations? If so, wrapping the enqueueReadBuffer in an if statement would be a good first step. Is this also the...

@egallicc would you mind comparing power consumption to ns/day on HIP like I did, but without erasing `downloadCountEvent`? Then we can investigate whether HIP reached a genuine 1000 ns/day on...

Is this the data transfer from GPU -> CPU? Perhaps it's avoidable using indirect command buffers in Metal, HIP, CUDA, OpenCL 2.0. Have the GPU encode any commands depending on...