artiq icon indicating copy to clipboard operation
artiq copied to clipboard

Coreanalyzer finding unknown exception messages

Open cjbe opened this issue 3 years ago • 8 comments

Bug Report

One-Line Summary

The core analyzer tool is reporting unknown exception types - this suggests data corruption somewhere on the core device.

Issue Details

Over several different Artiq setups, we have observed that the core analyzer frequently fails with an exception along the lines ValueError: 188 is not a valid ExceptionType (where the exception number changes over runs)

Adding some diagnostic prints, from one run we see a variety of exception types:

Unknown exception type 188 = 0b10111100
Unknown exception type  34 = 0b00100010
Unknown exception type  60 = 0b00111100

From the gateware here it looks like there should only be two valid exception types, underflow (0b00010100) and overflow (0b00100001). Hence it is concerning that we are getting these "impossible" exception messages.

From the collective experience at Oxford, it seems like we see this on all our systems, and has been around for a while:

  1. is this something everybody else sees?
  2. is it possible this was introduced with some of the RTIO changes a year or so ago?

Your System (omit irrelevant parts)

I am working from the current Artiq master, using a Kasli 2.0. I have seen this over at least 3 separate core devices, running different builds of gateware, over several versions of Artiq, and I believe on Kasli 1 as well as Kasli 2.

All our systems have at least one DRTIO satellite, and use SUServo.

cjbe avatar Aug 04 '21 15:08 cjbe

May or may not be related: I've seen it now three times that a bitstream was broken to the extent that it did not boot at all (no messages on the terminal). Reproducibly broken across rebuilds of the same bitstream. But random gateware changes or in at least two cases just reordering device instantiations in the bitstream made it work. No hardware changes, Kasli 2 in all cases, no RTIO, SU-Servo involved. There is always the chance that Vivado does something wrong but it seems a bit too frequent.

jordens avatar Aug 04 '21 16:08 jordens

May or may not be related: I've seen it now three times that a bitstream was broken to the extent that it did not boot at all (no messages on the terminal). Reproducibly broken across rebuilds of the same bitstream. But random gateware changes or in at least two cases just reordering device instantiations in the bitstream made it work. No hardware changes, Kasli 2 in all cases, no RTIO, SU-Servo involved. There is always the chance that Vivado does something wrong but it seems a bit too frequent.

I had that issue a lot with Sayma. Haven't hit it with Kasli myself though

hartytp avatar Aug 04 '21 16:08 hartytp

@cjbe Can you dump the entire raw analyzer buffer when the problem occurs? There may be more corruption than the exception number, but the rest is is more silent.

And just checking - did your bitstream meet timing?

sbourdeauducq avatar Aug 05 '21 01:08 sbourdeauducq

Can you dump the entire raw analyzer buffer when the problem occurs? There may be more corruption than the exception number, but the rest is is more silent.

Here is a raw dump that includes some corruption: dump.zip You are right - there is some additional misbehaviour. As well as 2 unknown Exception types, there are 4 Stopped messages.

And just checking - did your bitstream meet timing?

Yes! I have seen this on dozens of different builds, with different Artiq versions, and different gateware targets, all of which have met timing. Hence it seems likely that this is a true logic bug somewhere, than a miscompilation. I believe I have seen this with different versions of Vivado as well (but not sure about this)

cjbe avatar Aug 05 '21 14:08 cjbe

One thing to check: we are seeing that (since the external ref clock now with Kasli 2 goes to the FPGA first and not the Si5324) even a slightly high reference clock (powers that are below the usual ref clock power in a lab) will reliably cause very hard-to-trace failures (e.g. the ethernet link not working sometimes, probably depending on the skew w.r.t. the si5324 clock). Plausible, since the ESD diodes will happily forward the reference clock to the supply rails. @SingularitySurfer is filing an issue with some more details.

jordens avatar Aug 05 '21 16:08 jordens

@jordens good to know! I don't think that's the issue here - I have seen this on lots of systems without external clocks.

cjbe avatar Aug 05 '21 19:08 cjbe

@jordens: Just to double-check, what should roughly be the limit for "safe" powers at the input (wiki states +10 dBm)? Not sure I'm looking at the right FPGA pin docs.

dnadlinger avatar Aug 09 '21 09:08 dnadlinger

I'm not sure. 10 dBm sounds high. The wiki may well be referring to v1.

jordens avatar Aug 09 '21 09:08 jordens