opentitan icon indicating copy to clipboard operation
opentitan copied to clipboard

[opentitantool] Opentitantool is very slow on handling console output, writes 1 char at a time, cause data drops.

Open vsukhoml opened this issue 9 months ago • 12 comments

Description

I was trying to dump plenty of data using UART console, such as dumps lasted more than 666s (funny constant, but it was minimal when I noticed it. I was dumping lines of 64 hex characters and then noticed that symbols started to drop. In one place I counted that 158 (mod 256) chars were dropped. And in some cases uart printing just hang.

It seems that UART.STATUS may be not set in the correct value, so I tried to modify UART to use UART.FIFO_STATUS_TXLVL as indicator of non-empty FIFO, but it didn't help either. Tried several other than dif_uart_byte_send_polled() implementations, but all resulted in same behavior. Adding reset of FIFO seemed to help though to avoid hangs.

Original code:

static void print_uart(const char *buf, size_t len) {
 for (size_t i = 0; i<len; i++) {
    while (dif_uart_byte_send_polled(uart, buf[i])!=kDifOk) ;
  }
}

Then

static void print_uart(const char *buf, size_t len) {
  size_t sent = 0;

  while (len && dif_uart_bytes_send(uart, (const uint8_t *)buf, len, &sent) !=
                    kDifOk) {
    len -= sent;
    sent = 0;
  }
  return;
}

and finally:

// Custom UART stack for debugging
static bool uart_tx_full(const dif_uart_t *uart) {
  uint32_t reg = mmio_region_read32(uart->base_addr, UART_STATUS_REG_OFFSET);
  return bitfield_bit32_read(reg, UART_STATUS_TXFULL_BIT);
}

static bool uart_tx_idle(const dif_uart_t *uart) {
  uint32_t reg = mmio_region_read32(uart->base_addr, UART_STATUS_REG_OFFSET);
  return bitfield_bit32_read(reg, UART_STATUS_TXIDLE_BIT);
}

static uint32_t uart_tx_level(const dif_uart_t *uart) {
  uint32_t reg =
      mmio_region_read32(uart->base_addr, UART_FIFO_STATUS_REG_OFFSET);
  return bitfield_field32_read(reg, UART_FIFO_STATUS_TXLVL_FIELD);
}

static void uart_tx_fifo_write(const dif_uart_t *uart, uint8_t byte) {
  uint32_t reg = bitfield_field32_write(0, UART_WDATA_WDATA_FIELD, byte);
  mmio_region_write32(uart->base_addr, UART_WDATA_REG_OFFSET, reg);
}

static void uart_fifo_reset(const dif_uart_t *uart) {
  // Write to the relevant bits clears the FIFOs.
  uint32_t reg = mmio_region_read32(uart->base_addr, UART_FIFO_CTRL_REG_OFFSET);
  reg = bitfield_bit32_write(reg, UART_FIFO_CTRL_RXRST_BIT, true);
  reg = bitfield_bit32_write(reg, UART_FIFO_CTRL_TXRST_BIT, true);
  mmio_region_write32(uart->base_addr, UART_FIFO_CTRL_REG_OFFSET, reg);
}

static void print_uart(const char *buf, size_t len) {
  while (len) {
    if (uart_tx_level(uart) < 16) {
      uart_tx_fifo_write(uart, *buf++);
      len--;
    }
  }
  // Busy wait for the TX FIFO to be drained and for HW to finish processing
  // the last byte.
  while (!uart_tx_idle(uart)) {
  }
  uart_fifo_reset(uart);
  return;
}

All resulted in the same thing.

Test code:

static size_t print_hex_uart(const uint8_t *buf, size_t len, size_t counter) {
  char out[3];
  for (size_t i = 0; i < len; ++i) {
    uint8_t l = buf[i] & 0xf;
    uint8_t h = buf[i] >> 4;
    l += (l < 10) ? '0' : 'A' - 10;
    h += (h < 10) ? '0' : 'A' - 10;
    out[0] = h;
    out[1] = l;
    out[2] = '\n';
    counter += 2;
    if ((counter & 63) == 0)
      print_uart(out, 3);
    else
      print_uart(out, 2);
  }
  return counter;
}

status_t uart_dump(void) {
  int32_t rounds = 100000;
  uint8_t buf[256];
  uint32_t uart_counter = 0;
  uint64_t elapsed = 0;

  for (size_t i = 0; i < ARRAYSIZE(buf); i++)
    buf[i] = (uint8_t)i;

  ibex_timeout_t tmo = ibex_timeout_init(UINT32_MAX);
  while (rounds) {
    uart_counter = print_hex_uart(buf, sizeof(buf), uart_counter);
    rounds--;
  }
  elapsed += ibex_timeout_elapsed(&tmo);
  uart_counter += (uart_counter / 64);  // account for '\n'
  // Make sure the FIFO did not overflow.
  uint64_t freq =
      udiv64_slow((uint64_t)(uart_counter) * (uint64_t)1000000, elapsed, NULL);
  LOG_INFO("done in %ums (~ %usamples/s)",
           (uint32_t)udiv64_slow(elapsed, 1000, NULL), (uint32_t)freq);

  return OK_STATUS();
}

bool test_main(void) {
  // This is a hack to quickly get UART access and avoid slow base_printf()
  uart = (dif_uart_t *) ottf_console_get();
  CHECK_DIF_OK(dif_uart_watermark_tx_set(uart, kDifUartWatermarkByte64));
  status_t test_result = OK_STATUS();
  EXECUTE_TEST(test_result, uart_dump);
  return status_ok(test_result);
}

vsukhoml avatar May 03 '24 16:05 vsukhoml

Normal prints would look like:

000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F
404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F
606162636465666768696A6B6C6D6E6F707172737475767778797A7B7C7D7E7F
808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F
A0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBF
C0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF
E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF

And example for broken:

202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F
404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F
606162636465666768696A6B6C6D6E6F7071727374757677786C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDF
E0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF
000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F
404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F
606162636465666768696A6B6C6D6E6F707172737475767778797A7B7C7D7E7F
808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9F
A0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBF
C0C1C112131415161718191A1B1C1D1E1F
202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F

Normal output was first 665s, then it randomly started to make ragged lines with dropped characters.

vsukhoml avatar May 03 '24 16:05 vsukhoml

We've been noticing this as well during testing. In the past we had a test that could reliably cause characters to be dropped by changing the parity expected by the HyperDebug board. The characters were still present on the line between the device and the HyperDebug (confirmed with a logic analyzer) but HyperDebug itself was dropping them.

If I remember correctly, we didn't fix the bug in HyperDebug we just worked around it.

jwnrt avatar May 03 '24 17:05 jwnrt

Yes, I suspected Hyperdebug as well, but can't connect why it happens after certain time and then repeats ever after. Re-running test "resets" it. And hang happened on OT side - looks like some "infinite" loop waiting for TX idle or TX empty w/o timeouts. This suggests that issue may be originated on OT.

For drivers - I'd add a timeout in uart_putchar() and similar stuff, and this timeout can be derived from UART speed. e.g. a time to send 32 chars or so. At least this would prevent hangs.

vsukhoml avatar May 03 '24 17:05 vsukhoml

Baseless speculation but there could be a problem with the XON/XOFF flow control. Maybe OpenTitan sends a character while HyperDebug's buffers are full before it can ask it to stop. That might explain why it happens after a longer time running if there's more chance of the buffers filling up.

jwnrt avatar May 03 '24 17:05 jwnrt

I connected FTDI and dumped output. Output from OT UART is correct, nothing hanged, nothing lost.

But Opentitantool behavior is bad - it is extremely slow. strace on it gives:

write(1, "A", 1)                        = 1
write(1, "F", 1)                        = 1
write(1, "E", 1)                        = 1
write(1, "7", 1)                        = 1
write(1, "F", 1)                        = 1
write(1, "C", 1)                        = 1
write(1, "B", 1)                        = 1
write(1, "A", 1)                        = 1
write(1, "\r", 1)                       = 1
write(1, "\n", 1)                       = 1
write(1, "6", 1)                        = 1
write(1, "2", 1)                        = 1
write(1, "1", 1)                        = 1
write(1, "C", 1)                        = 1
write(1, "E", 1)                        = 1
write(1, "D", 1)                        = 1
write(1, "D", 1)                        = 1
write(1, "4", 1)                        = 1
write(1, "3", 1)                        = 1
write(1, "B", 1)                        = 1
write(1, "C", 1)                        = 1
write(1, "3", 1)                        = 1
write(1, "6", 1)                        = 1

It is practically stuck in writing event log. What I saw as a hang - was some issue at opentitan tool side.

vsukhoml avatar May 03 '24 18:05 vsukhoml

@jwnrt so, I looked into it and reclassified issue to be opentitantool related. Basically it is very inefficient on storing test logs into file and can't come up with full 115200 bit/s output because of use of slow syscalls in an unbuffered way. Looking more at strace dump I noticed:

write(1, "1", 1)                        = 1
ppoll([{fd=8, events=POLLIN}], 1, {tv_sec=0, tv_nsec=1000000}, [], 8) = 1 ([{fd=8, revents=POLLIN}], left {tv_sec=0, tv_nsec=997676})
read(8, "8CC9D1B513654D65EB8BE1A90BE81074"..., 256) = 256
write(1, "8", 1)                        = 1
write(1, "C", 1)                        = 1

Which means it reads data in chunks of 256 bytes, but writes as 1 byte 256 times in between, causing delays. These delays are likely result in overflowing buffers at Hyperdebug (out of memory). What I thought was a "hang" was actually bazel never receiving test result from hyperdebug since it was dropped.

vsukhoml avatar May 03 '24 19:05 vsukhoml

https://github.com/lowRISC/opentitan/commit/f38b01f4d453f6a618482b2897009938ba392680 seems to be a culprit.

reverting let mut buf = [0u8; 1]; back to let mut buf = [0u8; 256]; helps with output.

I tried this implementation which also removes 1-byte output on console:

    fn uart_read<T>(
        &mut self,
        device: &T,
        timeout: Duration,
        stdout: &mut Option<&mut dyn Write>,
    ) -> Result<bool>
    where
        T: ConsoleDevice + ?Sized,
    {
        let mut buf = [0u8; 256];
        let len = device.console_read(&mut buf, timeout)?;
        if len == 0 {
            return Ok(false);
        }
        if self.timestamp {
            let mut last_new_line: usize = 0;
            for i in 0..len {
                if self.newline {
                    // Print since previous newline
                    stdout
                        .as_mut()
                        .map_or(Ok(()), |out| out.write_all(&buf[last_new_line..i + 1]))?;
                    let t = humantime::format_rfc3339_millis(SystemTime::now());
                    stdout.as_mut().map_or(Ok(()), |out| {
                        out.write_fmt(format_args!("[{}  console]", t))
                    })?;
                    self.newline = false;
                }
                if buf[i] == b'\n' {
                    self.newline = true;
                    last_new_line = i + 1;
                }
            }
            stdout
                .as_mut()
                .map_or(Ok(()), |out| out.write_all(&buf[last_new_line..len]))?;
        } else {
            stdout
                .as_mut()
                .map_or(Ok(()), |out| out.write_all(&buf[..len]))?;
        }
        stdout.as_mut().map_or(Ok(()), |out| out.flush())?;

        // If we're logging, save it to the logfile.
        self.logfile
            .as_mut()
            .map_or(Ok(()), |f| f.write_all(&buf[..len]))?;
        self.append_buffer(&buf[..len]);
        Ok(true)
    }

I'm not sure what was the root cause of an issue that required this change, so not sure about side-effects on test flakiness. @nbdd0121, @timothytrippel what do you think?

vsukhoml avatar May 03 '24 22:05 vsukhoml

The commit you point out was made on purpose to mitigate test flakiness; opentitanlib was overconsuming uart data and dropping it since Uart contexts were getting created, buffering data, then dropping it. I think we need to revisit the architecture of this, but we do not want to revert back to reading more than one byte at a time until we fix the overall architecture.

timothytrippel avatar May 03 '24 22:05 timothytrippel

sure, this is where I'm trying to understand the root cause of the issue. but it sounds real issue is that there should be a layer between opentitanlib that would buffer output for later consumption, so uart data won't be lost, but I'm not very familiar with where exactly issue take place. The uart_read is confusing though as it just dumps to stdout and log file, so whatever depends on it seems to be on the other side of the pipe? This makes it even slower and more fragile. I'd rather turn uart_read in what the name suggests - just read from UART, introduce console/log on top, and may be an intermediate buffer in between.

Anyway, for my needs I can just use local patch.

vsukhoml avatar May 04 '24 15:05 vsukhoml

The issue is primarily with UartConsole::wait_for().

https://github.com/lowRISC/opentitan/issues/20580#issuecomment-1862107661

The UartConsole wants to search a buffer for matches, and when the search "reads" from the underlying UART buffer, the read is destructive to the original buffer. The architecture doesn't properly allow multiple consumers of the UART stream, and when doing the search, logging happens as a side effect of wait_for().

Prior to the switch to read just 1 character at a time, the search would greedily pull all available characters, even those beyond the matching regex. Then, that data would get dropped, as it would go out of scope once UartConsole::wait_for() returned. The change to a character at a time is quite inefficient, but addressing this tech debt has not been prioritized.

a-will avatar May 04 '24 15:05 a-will

FWIW I would much prefer if the mechanics of reading from the TTY (or whatever) were decoupled from process of searching through the contents for regex matches.

I think we should be continuously pulling data as fast as possible from the TTY into a buffer and concurrently logging the contents of that buffer. UartConsole should have its own cursor into that buffer that it uses to track wait_for matches. This would also let us do Pass!|Fail! matching in parallel with a separate cursor. Currently, tests that Fail! will usually just sit idle until Bazel times them out because most wait_for calls aren’t explicitly looking for and handling failures.

jwnrt avatar May 04 '24 18:05 jwnrt

@moidx

johngt avatar May 07 '24 10:05 johngt