artiq icon indicating copy to clipboard operation
artiq copied to clipboard

[RFC] Reducing RPC Latency

Open pca006132 opened this issue 3 years ago • 4 comments

ARTIQ Feature Request

Problem this request addresses

For small RPCs, latency is the major contributing factor for the round-trip time for RPC calls. Reducing latency can probably improve system performance for typical use-cases.

Describe the solution you'd like

Here are several approaches that I can think of:

  1. OS parameter tuning. This cannot be done by us, and would not be discussed further.
  2. Busy polling: The idea is to avoid the OS from putting our task to sleep, thus avoid the context switch cost etc.
  3. Rewrite the comm_kernel module in C/Rust, to avoid the slowness due to the Python interpreter.

For busy polling, I've already implemented it in #1681 and there is some improvement for our benchmark. However, there are some issues with this approach:

  1. Busy polling would cause high CPU usage (for one CPU core), which may heat up the CPU.
  2. As busy polling would occupy the current thread, other python threads may be starved (due to GIL).
  3. The benchmark doesn't really reflect use-cases where RPC calls are sparse. For those cases, the latency would not be reduced if we use blocking read for headers. The latency would be lower if we keep polling without blocking, but it would cause high CPU usage even when there are no RPC calls.

For example, for the following test case:

from artiq.experiment import *
import numpy as np

class LatencyCheck(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.n = 1
        self.rounds = 10
        self.t = np.array([0.0] * self.rounds)

    @rpc
    def something(self, x):
        pass

    @kernel
    def run(self):
        self.core.reset()
        n = self.n
        rounds = self.rounds
        t = self.t

        for i in range(rounds):
            total = 0.0
            for j in range(n):
                t0 = self.core.get_rtio_counter_mu()
                self.something(1)
                t1 = self.core.get_rtio_counter_mu()
                total += self.core.mu_to_seconds(t1 - t0)
            t[i] = total * 1000000 / n

            # simulate other work...
            for j in range(1000000):
                pass

        avg = 0.0
        std = 0.0
        for j in range(rounds):
            avg += t[j]
        avg /= rounds
        for j in range(rounds):
            std += (t[j] - avg) ** 2
        std = np.sqrt(std / rounds)
        print('Avg:', avg, 'µs, std:', std, "µs")

The result with the current master is around 455µs and 324µs for busy polling (without the adaptive blocking in #1681). If we use adaptive blocking, the result is similar to the current master. Busy polling all the time is effective in reducing the latency, but we would have 100% CPU usage (for 1 core) during kernel execution.

Another possible direction is to rewrite the module in Rust, this way we don't have an interpreter thus no latency caused by the interpreter. This also solves the second problem caused by busy polling as we can probably do the polling in a separate thread without holding the GIL. However, this would take some time and I have no data to show the potential improvement.

It would be helpful to know if the users want lower latency or lower CPU usage and if there are other possible directions that we can try.

pca006132 avatar May 30 '21 15:05 pca006132

For me, exposing messaging as a primitive to kernel code directly (that is, instead of only as a synchronous request/response pair) would be more important than a hard-earned latency improvement earned through fiddling with busy-polling/…, as currently, quite a lot of the latency-critical RPCs logically check for the presence/absence of a message, where no message is the common case (e.g. check_pause()).

dnadlinger avatar May 30 '21 18:05 dnadlinger

(The units should be µs, by the way.)

dnadlinger avatar May 31 '21 00:05 dnadlinger

For me, exposing messaging as a primitive to kernel code directly (that is, instead of only as a synchronous request/response pair) would be more important than a hard-earned latency improvement earned through fiddling with busy-polling/…, as currently, quite a lot of the latency-critical RPCs logically check for the presence/absence of a message, where no message is the common case (e.g. check_pause()).

It should be possible, but you still need a thread to wait for the messages, and I don't think you can do it with low latency in python due to GIL? (I guess)

(The units should be µs, by the way.)

Fixed.

pca006132 avatar May 31 '21 04:05 pca006132

It should be possible, but you still need a thread to wait for the messages, and I don't think you can do it with low latency in python due to GIL? (I guess)

This is true. However, the impetus would be more to provide an option to avoid the latency sensitivity in the first place. At present, all messages from host to kernel are synchronous (in the sense of blocking on the kernel side), thus making them necessarily latency-sensitive. While this is a good match for cases where e.g. parameters for some Bayesian inference scheme need to be recomputed on the host on the fly, there are other cases where there is no actual "data dependence". For instance, for termination/interruption requests from the host to the kernel (check_pause), a "message box" model is much more natural.

dnadlinger avatar Jun 07 '21 11:06 dnadlinger