riscv-openocd icon indicating copy to clipboard operation
riscv-openocd copied to clipboard

Improve stability of riscv-tests for targets with low remotetimeout value

Open aap-sc opened this issue 1 year ago • 5 comments

After https://github.com/riscv-software-src/riscv-tests/pull/553 was merged in @en-sc expressed the concern that this is a rather radical change and it reduces coverage of the test suite.

During a very heated discussion we've decided that:

  1. It would be nice if we could bring smaller timeouts back (for all targets)
  2. However, in order to do that we should get rid of the sporadic failures when these timeouts are small. Most likely the presence of these timeout errors is a result of missing keep_alive packets that OpenOCD should send to GDB - this needs to be investigated and fixed (if needed)

@en-sc , @TommyMurphyTM1234 FYI

aap-sc avatar May 02 '24 12:05 aap-sc

@aap-sc Please, what is the relationship between the timeouts in the testsuite and the coverage? In what way does the extended timeout reduce the coverage? Thank you for explanation.

JanMatCodasip avatar May 03 '24 06:05 JanMatCodasip

I presumed that the reasoning here was that (subject to further investigation/analysis) some of the unexpected test failures/exceptions might point to problems with OpenOCD itself - in particular not sending sufficient "keep alive" packets to GDB to avoid anomalous timeouts?

FWIW at the moment I'm encountering several problems when running tests:

  • DisconnectTest: https://github.com/riscv-collab/riscv-openocd/issues/1058 - this one seems to be a non-timeout/keep alive related bug in OpenOCD.
  • MemorySampleSingle against Spike: https://github.com/riscv-collab/riscv-openocd/issues/1049#issuecomment-2088584628
  • Several other sporadic failures against Spike for which I'm currently trying to collect some data in order to see what's up.

TommyMurphyTM1234 avatar May 03 '24 07:05 TommyMurphyTM1234

@JanMatCodasip

what is the relationship between the timeouts in the testsuite and the coverage? In what way does the extended timeout reduce the coverage?

One may have an impression that the current testsuite from riscv-tests is a set of directed tests. It is not. It is more like a set of end-to-end tests that involve a rather complex communication between gdb, openocd, spike/jtag adapter.

Typical GDB command should never fail due to a timeout value even if the target is relatively slow (there are limits of course). Most ordinary users won't touch (and should not touch) remotetimeout settings.

"Reducing coverage" in this context means that after I've increased the remotetimeout for all targets we avoid a situation where OpenOCD has to send a "keep_alive" packet for operation to be successful. This just hides an underlying issues with some operations (and this was the intention till the test suite is stabilized - this is what this ticket is about).

For example, in context of https://github.com/riscv-collab/riscv-openocd/issues/1049 ( (sporadic DownloadTest failure part) the underlying issue is that when OpenOCD processes qCRC request it does not send keep_alive packets. Here is the patch to address this: https://review.openocd.org/c/openocd/+/8227

If we always have these timeouts increased, situations like the one presented above won't happen. Means it won't be tested -> coverage is reduced.

aap-sc avatar May 03 '24 07:05 aap-sc

I presumed that the reasoning here was that (subject to further investigation/analysis) some of the unexpected test failures/exceptions might point to problems with OpenOCD itself - in particular not sending sufficient "keep alive" packets to GDB to avoid anomalous timeouts?

Yes, this is precisely the reason.

aap-sc avatar May 03 '24 07:05 aap-sc

@aap-sc @TommyMurphyTM1234 Understood, thank you for explanation!

JanMatCodasip avatar May 03 '24 13:05 JanMatCodasip