artiq icon indicating copy to clipboard operation
artiq copied to clipboard

Allocation for RPC results slow, times out core device connection

Open dnadlinger opened this issue 5 years ago • 6 comments

The comm <-> kernel roundtrip to allocate extra memory for an RPC result has extremely high latency, to the point where it is easy to make the master-to-core-device TCP connection time out.

For instance, have a look at the following innocent program:

from artiq.experiment import *

class TestListRPC(EnvExperiment):
    def build(self):
        self.setattr_device("core")

    def make_data(self) -> TList(TList(TInt32)):
        return [[0]] * 1000

    @kernel
    def run(self):
        t0 = self.core.get_rtio_counter_mu()
        data = self.make_data()
        print(self.core.get_rtio_counter_mu() - t0)
        print("Got data, first element:", data[0])

Running this on latest ARTIQ fails with a TimeoutError, as the repeated roundtips between comm and kernel CPU to allocate memory for the list elements take long enough for the TCP timeout to elapse. (The exact point at which this start to fail might vary depending on setup details, but of course this will be a performance problem even before complete failure.)

Any idea why this takes several milliseconds? (Profiling this is a bit messy…)

dnadlinger avatar May 19 '19 17:05 dnadlinger

Profiling this is a bit messy…

@whitequark had started developing a sampling profiler, but I'm not sure what the status is.

sbourdeauducq avatar May 20 '19 02:05 sbourdeauducq

The profiler does work (it should be functional in master), unfortunately there are two issues with it:

  • There is a libunwind patch necessary that the upstream requires Linux tests for, which is why it's not actually upstream. (It seems pretty hard to make these tests. Maybe I'll try again some day.) The patch is D46971.
  • The profiler only collects hits and edges, and reconstructing graphs from edges is an inherently lossy process. It outputs what are as far as I can tell completely valid cachegrind files, but kcachegrind's interpretation of them can be really confusing.

@dnadlinger I can try to refresh my memory and write some introduction for using the profiler.

whitequark avatar May 20 '19 17:05 whitequark

The slowness is due to allocation request/reply per element, which we can hardly do anything about it. (Some optimization on zynq could make the allocation request per element down to 8 microseconds, but that would still take 9ms for the test case above) Recent PR #1510 added optimization for commonly used lists/arrays (elements of type Bool/Int32/Int64/Float, the performance for them should be much faster. However, there is not much improvement for the test case above.

pca006132 avatar Sep 03 '20 03:09 pca006132

In my mind, the solution here is really just for the comms CPU (core0) not to interpret any of the payload, but ship it to the kernel CPU (core1) instead, where everything can be done locally, without any locking.

dnadlinger avatar Sep 04 '20 17:09 dnadlinger

In my mind, the solution here is really just for the comms CPU (core0) not to interpret any of the payload, but ship it to the kernel CPU (core1) instead, where everything can be done locally, without any locking.

Not sure about or1k, but for zynq, the async runtime expects to be run in a single core, so we cannot really move that to core1.

One way to solve the problem is to modify the protocol to include the length of the return buffer at the start, read the whole buffer in core0 and copy that to core1, but I am not sure about the performance of this approach. I don't think the overhead for sync_channel without context switch is that big.

pca006132 avatar Sep 05 '20 05:09 pca006132

Just for the record, the above test case still takes about 0.5 s on 1f40f3ce157f518ef03474d1408b08cd7c50e994/Kasli – but hey, at least, it doesn't time out anymore.

dnadlinger avatar Mar 20 '21 01:03 dnadlinger