artiq
artiq copied to clipboard
[RFC] Reducing RPC Latency
ARTIQ Feature Request
Problem this request addresses
For small RPCs, latency is the major contributing factor for the round-trip time for RPC calls. Reducing latency can probably improve system performance for typical use-cases.
Describe the solution you'd like
Here are several approaches that I can think of:
- OS parameter tuning. This cannot be done by us, and would not be discussed further.
- Busy polling: The idea is to avoid the OS from putting our task to sleep, thus avoid the context switch cost etc.
- Rewrite the
comm_kernel
module in C/Rust, to avoid the slowness due to the Python interpreter.
For busy polling, I've already implemented it in #1681 and there is some improvement for our benchmark. However, there are some issues with this approach:
- Busy polling would cause high CPU usage (for one CPU core), which may heat up the CPU.
- As busy polling would occupy the current thread, other python threads may be starved (due to GIL).
- The benchmark doesn't really reflect use-cases where RPC calls are sparse. For those cases, the latency would not be reduced if we use blocking read for headers. The latency would be lower if we keep polling without blocking, but it would cause high CPU usage even when there are no RPC calls.
For example, for the following test case:
from artiq.experiment import *
import numpy as np
class LatencyCheck(EnvExperiment):
def build(self):
self.setattr_device("core")
self.n = 1
self.rounds = 10
self.t = np.array([0.0] * self.rounds)
@rpc
def something(self, x):
pass
@kernel
def run(self):
self.core.reset()
n = self.n
rounds = self.rounds
t = self.t
for i in range(rounds):
total = 0.0
for j in range(n):
t0 = self.core.get_rtio_counter_mu()
self.something(1)
t1 = self.core.get_rtio_counter_mu()
total += self.core.mu_to_seconds(t1 - t0)
t[i] = total * 1000000 / n
# simulate other work...
for j in range(1000000):
pass
avg = 0.0
std = 0.0
for j in range(rounds):
avg += t[j]
avg /= rounds
for j in range(rounds):
std += (t[j] - avg) ** 2
std = np.sqrt(std / rounds)
print('Avg:', avg, 'µs, std:', std, "µs")
The result with the current master is around 455µs and 324µs for busy polling (without the adaptive blocking in #1681). If we use adaptive blocking, the result is similar to the current master. Busy polling all the time is effective in reducing the latency, but we would have 100% CPU usage (for 1 core) during kernel execution.
Another possible direction is to rewrite the module in Rust, this way we don't have an interpreter thus no latency caused by the interpreter. This also solves the second problem caused by busy polling as we can probably do the polling in a separate thread without holding the GIL. However, this would take some time and I have no data to show the potential improvement.
It would be helpful to know if the users want lower latency or lower CPU usage and if there are other possible directions that we can try.
For me, exposing messaging as a primitive to kernel code directly (that is, instead of only as a synchronous request/response pair) would be more important than a hard-earned latency improvement earned through fiddling with busy-polling/…, as currently, quite a lot of the latency-critical RPCs logically check for the presence/absence of a message, where no message is the common case (e.g. check_pause()
).
(The units should be µs, by the way.)
For me, exposing messaging as a primitive to kernel code directly (that is, instead of only as a synchronous request/response pair) would be more important than a hard-earned latency improvement earned through fiddling with busy-polling/…, as currently, quite a lot of the latency-critical RPCs logically check for the presence/absence of a message, where no message is the common case (e.g.
check_pause()
).
It should be possible, but you still need a thread to wait for the messages, and I don't think you can do it with low latency in python due to GIL? (I guess)
(The units should be µs, by the way.)
Fixed.
It should be possible, but you still need a thread to wait for the messages, and I don't think you can do it with low latency in python due to GIL? (I guess)
This is true. However, the impetus would be more to provide an option to avoid the latency sensitivity in the first place. At present, all messages from host to kernel are synchronous (in the sense of blocking on the kernel side), thus making them necessarily latency-sensitive. While this is a good match for cases where e.g. parameters for some Bayesian inference scheme need to be recomputed on the host on the fly, there are other cases where there is no actual "data dependence". For instance, for termination/interruption requests from the host to the kernel (check_pause
), a "message box" model is much more natural.