artiq
artiq copied to clipboard
core device execution time variation
Bug Report
Issue Details
I observe large variations in the execution time of kernels on the core device (Kasli v1.x). While the standard deviation of the execution time may be small, in a small number of cases there are anomalous long processing delays. That is, the execution time has a long tail, implying non-gaussian statistics. This can be observed in code examples below.
In my lab's experiment codebase we make heavy use of abstraction. We write modular kernel methods with sufficient built-in slack to reliably run without RTIOUnderflow (eg Doppler cooling, gates, detection). We combine these modular elements to produce large monolithic kernels. Unfortunately, the slack that works for the atomic modules often proves insufficient when many modules are called in succession. We suspect what's happening is a random walk in slack; with a sufficient number of reps we hit RTIOUnderflow.
Actual (undesired) Behavior
Below are tables showing the execution time in ns for 80 successive kernel calls for a variety of kernel methods. At the end of each table are summary statistics that exclude the first two reps. The first two are excluded from the statistics because they are known to be reproducibly slower than successive reps.
The example code relies on get_rtio_counter_mu() to put a bound on kernel execution time. I note that the non-determinism of get_rtio_counter_mu()
method is documented. In my test I put a bound on this in test_dummy(). The balance of uncertainty seems attributable to the core device microprocessor.
Another manifestation of this is in https://github.com/m-labs/artiq/issues/1987.
$ artiq_run repository/core_device_process_durations_jwb.py
### test_dummy ###
0: 1296, 1: 312, 2: 320, 3: 320, 4: 320, 5: 320, 6: 320, 7: 320,
8: 320, 9: 320, 10: 320, 11: 312, 12: 320, 13: 320, 14: 312, 15: 320,
16: 320, 17: 320, 18: 320, 19: 320, 20: 320, 21: 320, 22: 320, 23: 312,
24: 320, 25: 320, 26: 320, 27: 320, 28: 320, 29: 320, 30: 312, 31: 320,
32: 320, 33: 312, 34: 320, 35: 320, 36: 320, 37: 320, 38: 320, 39: 312,
40: 312, 41: 320, 42: 320, 43: 320, 44: 320, 45: 312, 46: 320, 47: 320,
48: 320, 49: 320, 50: 320, 51: 320, 52: 320, 53: 320, 54: 320, 55: 320,
56: 320, 57: 320, 58: 320, 59: 320, 60: 320, 61: 320, 62: 312, 63: 320,
64: 320, 65: 320, 66: 320, 67: 320, 68: 320, 69: 320, 70: 320, 71: 320,
72: 320, 73: 320, 74: 320, 75: 312, 76: 320, 77: 320, 78: 312, 79: 320,
mean: 319, std: 3, max deviation: 8, max frac deviation: 0.025
### test_ttl_pulse ###
0: 4320, 1: 920, 2: 904, 3: 896, 4: 896, 5: 896, 6: 904, 7: 904,
8: 904, 9: 896, 10: 904, 11: 896, 12: 896, 13: 904, 14: 896, 15: 896,
16: 896, 17: 904, 18: 896, 19: 896, 20: 904, 21: 896, 22: 896, 23: 904,
24: 896, 25: 904, 26: 904, 27: 896, 28: 896, 29: 904, 30: 904, 31: 896,
32: 904, 33: 904, 34: 904, 35: 896, 36: 896, 37: 904, 38: 904, 39: 896,
40: 896, 41: 896, 42: 896, 43: 896, 44: 896, 45: 904, 46: 904, 47: 896,
48: 896, 49: 904, 50: 904, 51: 896, 52: 896, 53: 896, 54: 904, 55: 896,
56: 896, 57: 904, 58: 904, 59: 896, 60: 904, 61: 904, 62: 904, 63: 896,
64: 896, 65: 904, 66: 904, 67: 904, 68: 896, 69: 896, 70: 896, 71: 904,
72: 904, 73: 896, 74: 896, 75: 904, 76: 904, 77: 896, 78: 896, 79: 896,
mean: 900, std: 4, max deviation: 8, max frac deviation: 0.009
### test_reset ###
0: 5814680, 1: 4844920, 2: 4875376, 3: 4876640, 4: 4929560, 5: 5809232, 6: 4876648, 7: 4878136,
8: 4876536, 9: 5859448, 10: 4915488, 11: 4825176, 12: 4876984, 13: 4879168, 14: 5910032, 15: 4876536,
16: 4825952, 17: 4877464, 18: 4930624, 19: 5858080, 20: 4878960, 21: 4875568, 22: 4826144, 23: 5911616,
24: 4863920, 25: 4872912, 26: 4882088, 27: 4876640, 28: 5859728, 29: 4876536, 30: 4878256, 31: 4876456,
32: 4930040, 33: 5807816, 34: 4876368, 35: 4878208, 36: 4877904, 37: 5859864, 38: 4863872, 39: 4877456,
40: 4877024, 41: 4876944, 42: 5859152, 43: 4879648, 44: 4876496, 45: 4877056, 46: 4928392, 47: 5808752,
48: 4878328, 49: 4876344, 50: 4877624, 51: 5858760, 52: 4913240, 53: 4879072, 54: 4876544, 55: 4826096,
56: 5911192, 57: 4876440, 58: 4878592, 59: 4877904, 60: 4876592, 61: 5859632, 62: 4876592, 63: 4877936,
64: 4861432, 65: 4945680, 66: 5807392, 67: 4876568, 68: 4877800, 69: 4878480, 70: 5857784, 71: 4862752,
72: 4878096, 73: 4877064, 74: 4878816, 75: 5858208, 76: 4878056, 77: 4876640, 78: 4877704, 79: 4930336,
mean: 5079931, std: 395087, max deviation: 1086440, max frac deviation: 0.214
### test_simple_float ###
0: 1704, 1: 336, 2: 320, 3: 320, 4: 320, 5: 320, 6: 320, 7: 320,
8: 320, 9: 320, 10: 312, 11: 320, 12: 320, 13: 312, 14: 312, 15: 320,
16: 320, 17: 312, 18: 312, 19: 320, 20: 320, 21: 320, 22: 320, 23: 312,
24: 320, 25: 320, 26: 320, 27: 320, 28: 320, 29: 320, 30: 320, 31: 320,
32: 320, 33: 320, 34: 320, 35: 320, 36: 312, 37: 320, 38: 312, 39: 312,
40: 312, 41: 320, 42: 320, 43: 320, 44: 320, 45: 320, 46: 312, 47: 312,
48: 320, 49: 312, 50: 312, 51: 320, 52: 320, 53: 312, 54: 312, 55: 320,
56: 320, 57: 312, 58: 312, 59: 320, 60: 312, 61: 312, 62: 320, 63: 312,
64: 312, 65: 320, 66: 320, 67: 320, 68: 320, 69: 320, 70: 320, 71: 320,
72: 320, 73: 320, 74: 312, 75: 320, 76: 320, 77: 312, 78: 312, 79: 320,
mean: 317, std: 4, max deviation: 8, max frac deviation: 0.025
### test_long_float ###
0: 5027304, 1: 4846088, 2: 4843176, 3: 4857208, 4: 4851056, 5: 4840888, 6: 4848184, 7: 4847440,
8: 4845248, 9: 4852192, 10: 4846528, 11: 4843688, 12: 4853344, 13: 4848872, 14: 4845480, 15: 4847336,
16: 4845144, 17: 4858360, 18: 4848984, 19: 4845440, 20: 4847328, 21: 4845112, 22: 4848144, 23: 4843696,
24: 4850608, 25: 4956864, 26: 4847328, 27: 4845336, 28: 4848136, 29: 4845808, 30: 4848472, 31: 4841272,
32: 4841992, 33: 4846704, 34: 4851592, 35: 4846552, 36: 4843512, 37: 4858800, 38: 4855152, 39: 4842928,
40: 4849728, 41: 4847440, 42: 4845256, 43: 4852216, 44: 4846520, 45: 4843552, 46: 4853344, 47: 4848984,
48: 4842704, 49: 4849600, 50: 4845136, 51: 4858352, 52: 4848976, 53: 4845440, 54: 4847280, 55: 4845328,
56: 4848144, 57: 4843832, 58: 4850608, 59: 4841728, 60: 4842024, 61: 4846704, 62: 4851632, 63: 4846552,
64: 4843640, 65: 4858816, 66: 4854456, 67: 4852744, 68: 4846424, 69: 4857208, 70: 4851056, 71: 4840880,
72: 4848184, 73: 4847368, 74: 4851296, 75: 4843832, 76: 4850736, 77: 4841720, 78: 4842232, 79: 4850928,
mean: 4849314, std: 13044, max deviation: 115984, max frac deviation: 0.024
### test_urukul ###
0: 24992, 1: 5424, 2: 5328, 3: 5976, 4: 5352, 5: 5304, 6: 5368, 7: 5264,
8: 5344, 9: 5304, 10: 5224, 11: 5336, 12: 5608, 13: 5992, 14: 5288, 15: 5312,
16: 5608, 17: 5328, 18: 5336, 19: 5360, 20: 5512, 21: 5328, 22: 5400, 23: 5176,
24: 5416, 25: 5336, 26: 5640, 27: 5184, 28: 5224, 29: 5288, 30: 5384, 31: 5400,
32: 5456, 33: 5392, 34: 5336, 35: 5376, 36: 5256, 37: 5384, 38: 5280, 39: 5472,
40: 5192, 41: 5336, 42: 5208, 43: 5368, 44: 5536, 45: 5352, 46: 5208, 47: 5240,
48: 5248, 49: 5592, 50: 5296, 51: 5600, 52: 5192, 53: 5272, 54: 5320, 55: 5464,
56: 5296, 57: 5288, 58: 5368, 59: 5376, 60: 5360, 61: 5360, 62: 5456, 63: 5208,
64: 5376, 65: 5248, 66: 5616, 67: 5208, 68: 5344, 69: 5288, 70: 5272, 71: 5336,
72: 5360, 73: 5208, 74: 5584, 75: 5248, 76: 5824, 77: 5368, 78: 5256, 79: 5264,
mean: 5369, std: 158, max deviation: 816, max frac deviation: 0.152
### test_dataset ###
0: 116560, 1: 72136, 2: 47848, 3: 45720, 4: 44864, 5: 44584, 6: 43184, 7: 43272,
8: 44168, 9: 44856, 10: 43952, 11: 45192, 12: 44880, 13: 45272, 14: 45744, 15: 44544,
16: 43680, 17: 45656, 18: 44592, 19: 44016, 20: 44368, 21: 44520, 22: 43920, 23: 44840,
24: 45016, 25: 44096, 26: 44592, 27: 45736, 28: 44520, 29: 45104, 30: 43736, 31: 43952,
32: 45632, 33: 44816, 34: 44376, 35: 45416, 36: 44520, 37: 46088, 38: 43912, 39: 44576,
40: 44456, 41: 44248, 42: 44072, 43: 44136, 44: 44640, 45: 46440, 46: 45312, 47: 44840,
48: 44936, 49: 44624, 50: 44640, 51: 43704, 52: 43896, 53: 44416, 54: 45656, 55: 44840,
56: 45256, 57: 44448, 58: 45616, 59: 43960, 60: 43696, 61: 44544, 62: 44520, 63: 44064,
64: 44056, 65: 74952, 66: 44480, 67: 44624, 68: 43856, 69: 43984, 70: 43624, 71: 43496,
72: 44632, 73: 44696, 74: 44776, 75: 45136, 76: 45856, 77: 45104, 78: 43992, 79: 45376,
mean: 45043, std: 3493, max deviation: 31768, max frac deviation: 0.705
Repeat these tests 800 times per loop and only print the statistics.
$ artiq_run repository/core_device_process_durations_jwb.py
### test_dummy ###
mean: 318, std: 3, max deviation: 8, max frac deviation: 0.025
### test_ttl_pulse ###
mean: 901, std: 4, max deviation: 8, max frac deviation: 0.009
### test_reset ###
mean: 5088269, std: 398924, max deviation: 1284992, max frac deviation: 0.253
### test_simple_float ###
mean: 318, std: 4, max deviation: 8, max frac deviation: 0.025
### test_long_float ###
mean: 4844143, std: 10057, max deviation: 122656, max frac deviation: 0.025
### test_urukul ###
mean: 5365, std: 151, max deviation: 1016, max frac deviation: 0.189
### test_dataset ###
mean: 44621, std: 836, max deviation: 10920, max frac deviation: 0.245
It's a common to observe that tests with small variability show larger variability a minute later. Compare above the maximum fractional deviation for test_dataset with 80 reps and 800 reps.
Steps to Reproduce
from artiq.experiment import *
import numpy as np
class CoreDeviceProcessDurationsJWB(EnvExperiment):
kernel_invariants = {"ftw0", "asf0", 'mindelay'}
def build(self):
self.trials = 800
self.setattr_device("core")
self.setattr_device("ttl7")
self.ttl = self.ttl7
self.urukul_channel = self.get_device('urukul0_ch0')
self.urukul_switch = self.get_device('ttl_urukul0_sw0')
#self.zotino = self.get_device('zotino0')
self.ftw0 = self.urukul_channel.frequency_to_ftw(100*MHz)
self.asf0 = self.urukul_channel.amplitude_to_asf(0.85)
self.mindelay = 1*us
def prepare(self):
self.durations = np.full(self.trials, 0, dtype=np.int64)
def print_results(self, test_name, durations):
print("### {} ### ".format(test_name))
first = 2
mean = np.mean(durations[first:])
std = np.std(durations[first:])
maxdev = np.max(durations[first:]) - np.min(durations[first:])
maxfdev = maxdev / np.mean(durations[first:])
deviations_found = 0
s = ""
for i in range(len(durations)):
s += "{:3}: {:9.0f}, ".format(i, durations[i])
if (i+1)%8 == 0:
s += '\n'
#print(s)
s = "mean: {:9.0f}, std: {:9.0f}, max deviation: {:9.0f}, max frac deviation: {:.3f}\n".format(
mean, std, maxdev, maxfdev)
print(s)
def run(self):
tests = ['test_dummy', 'test_ttl_pulse', 'test_reset', 'test_simple_float', 'test_long_float']
tests = tests + ['test_urukul', 'test_dataset']
for test in tests:
testf = getattr(self, test)
durations = self.timer(testf)
self.print_results(test, durations)
@kernel
def timer(self, kernel_method):
self.core.reset()
delay(1*ms)
for i in range(self.trials):
delay(1*ms)
t1 = self.core.get_rtio_counter_mu()
td = kernel_method()
delay(td) # default td=self.mindelay, required so kernel_method() is not optimized away
t2 = self.core.get_rtio_counter_mu()
self.durations[i] = t2 - t1
self.core.wait_until_mu(now_mu())
return self.durations
@kernel
def test_dummy(self):
return self.mindelay
@kernel
def test_ttl_pulse(self):
self.ttl.pulse(self.mindelay)
return self.mindelay
@kernel
def test_reset(self):
self.core.reset()
return self.mindelay
@kernel
def test_simple_float(self):
td = np.cos(3.14159/2) + np.sin(0.0001) + np.log(1.001) # ~0
return self.mindelay + td*ns
@kernel
def test_long_float(self):
td = 0.0
for _ in range(10):
td -= np.cos(3.14159 / 2) + np.sin(td) + np.log(1.00001) # ~0
td += np.cos(3.14159 / 2) + np.sin(td) + np.log(1.00001) # ~0
return self.mindelay + td*ns
@kernel
def test_urukul(self):
delay(1*ms)
self.urukul_channel.set_mu(ftw=self.ftw0, asf=self.asf0)
self.urukul_switch.pulse(10*us)
return self.mindelay
@kernel
def test_dataset(self):
# Note that set_dataset is decorated with @rpc(flags={"async"})
self.set_dataset('ds0', 1.1, broadcast=True)
return self.mindelay
Expected Behavior
- Core device execution time variation is manageable if it's bounded. In ARITQ7 it does not appear to be well bounded.
- Presently we add huge delays to each experiment repetition to patch over this class of problem. This solution comes at the cost of much reduced experiment duty cycle.
- Subjectively this problem seems more prevalent in ARTIQ 7 than previous ARTIQ releases. Is it related to the switch to VexRiscv from mor1kx? Are there benchmarks for execution variation in the ARTIQ codebase? It would be helpful to track this in order to detect performance regressions.
Questions
- Why are the first 1 (or 2) reps of each rep of a loop so much slower than subsequent reps?
- Would a bare VexRISCv uP have deterministic execution?
- What are the leading factors that cause the ARTIQ MiSOC to suffer from non-determinism?
System
- Operating System: Ubuntu
- ARTIQ version: latest commit of tag version 7
- Environment is nix shell
- Hardware involved: Kasli v1.x
The only way to guarantee fully deterministic CPU execution times is to use SRAM only; as soon as you use SDRAM and caches then you have this sort of issue. Zynq is even worse though the much faster speed makes up for it.
SRAM is a very limited resource on FPGAs and it tends to be slow for large blocks. Can you use ARTIQ_DUMP_ELF to get the binary for the largest kernel in your experiment and see if it could fit?
I assume you beam BRAM not SRAM.
Can you use ARTIQ_DUMP_ELF to get the binary for the largest kernel in your experiment and see if it could fit?
The constraints for code complied to ELF are extreme.
- raise ValueError("Experiment entry point must be a kernel")
- raise ValueError("Experiment must not use RPC") I can't refactor my code base subject to these constraints. I depend too much on many higher level ARTIQ features.
I can see this a problem inherent in SDRAM (link). Have you looked into using RLDRAM? The arxiv paper describes a memory controller design that predictably manages accesses to the RLDRAM.
BRAM is a particular form of SRAM. The constraints you mention are specific to artiq_compile and not ARTIQ_DUMP_ELF, the latter has no restrictions. The problem is inherent to all DRAM, the paper only mentions a reduction of the latency variation, not an elimination. And the vast majority of boards don't support RLDRAM.
The constraints you mention are specific to artiq_compile and not ARTIQ_DUMP_ELF, the latter has no restrictions.
I'm not familiar with this feature of ARTIQ. Please give an example of how of how to use it to do an ELF dump. There are a lot of options for the Target class...
The problem is inherent to all DRAM, the paper only mentions a reduction of the latency variation, not an elimination. And the vast majority of boards don't support RLDRAM.
A 10X reduction in latency variation would be a big win. Sinara builds it's own boards so that's not necessarily a roadblock. I guess it would entail writing another memory controller. @gkasprow Have you worked with RLDRAM?
Please give an example of how of how to use it to do an ELF dump.
Just set the ARTIQ_DUMP_ELF
environment variable to a string (which will become the file name), either when calling artiq_run
, or when starting artiq_master
(the latter IIRC will just write to the result file directory).
A 10X reduction in latency variation would be a big win. Sinara builds it's own boards so that's not necessarily a roadblock.
Just put SRAM on the board then. But SRAM is expensive and we need to support the existing fleet as well as new boards for which SRAM isn't necessarily a good option. Preloading of kernel or kernel parts into existing on-chip SRAM (e.g. FPGA BRAM) might be better.
I followed the advice of @dnadlinger for a typical experiment in my lab. Here's the size.
$ ls -lh
total 1.8M
-rw-r--r-- 1 iirabi iirabi 33K Nov 14 17:38 artiq_dump.elf
How much BRAM is typically available on a Kasli running ARTIQ? Is there an established path for fully loading a kernel into BRAM?
33K would fit in BRAM but needs gateware, firmware and perhaps (minor) compiler modifications.
Is there a way to do an initial test and see how the latency variation improves? Has the latency variation for Zynq been evaluated yet? It may be that both core device variants would benefit from use of BRAM for sufficiently simple kernels.
Zynq would use OCM not BRAM.
Has the latency variation for Zynq been evaluated yet? I'm curious how the test I wrote above would perform.