msquic icon indicating copy to clipboard operation
msquic copied to clipboard

increase spinquic watchdog timeout

Open ProjectsByJackHe opened this issue 2 weeks ago • 6 comments

Description

As discussed in issue #5491 , from logs, the watchdog assert is firing. For now, let's increase it by 100%.

Testing

CI

Documentation

N/A

ProjectsByJackHe avatar Dec 09 '25 21:12 ProjectsByJackHe

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 85.64%. Comparing base (4e84609) to head (18ddaa0). :warning: Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5647      +/-   ##
==========================================
- Coverage   86.34%   85.64%   -0.71%     
==========================================
  Files          60       60              
  Lines       18663    18663              
==========================================
- Hits        16114    15983     -131     
- Misses       2549     2680     +131     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Dec 09 '25 22:12 codecov[bot]

I am not that familiar with the spin test, but isn't this only going to cause the spintest to run for a longer time? Looking at the sources very fast, the time you change control the time spent spinning, and there is a WATCHDOG_WIGGLE_ROOM that gives a bit of extra time for the watchdog.

guhetier avatar Dec 09 '25 22:12 guhetier

I am not that familiar with the spin test, but isn't this only going to cause the spintest to run for a longer time? Looking at the sources very fast, the time you change control the time spent spinning, and there is a WATCHDOG_WIGGLE_ROOM that gives a bit of extra time for the watchdog.

Yes! good catch

ProjectsByJackHe avatar Dec 10 '25 01:12 ProjectsByJackHe

Did you investigate, based on the traces, what was pending when the timeout fired? 2 / 3 seconds is already quite a lot. It is possible something was delayed on a slow VM, but it is possible too that a softlock / deadlock was happening in MsQuic.

guhetier avatar Dec 10 '25 17:12 guhetier

Did you investigate, based on the traces, what was pending when the timeout fired? 2 / 3 seconds is already quite a lot. It is possible something was delayed on a slow VM, but it is possible too that a softlock / deadlock was happening in MsQuic.

Based on the ETL trace from the link I added in the issue, I couldn't find any deadlocks happening. Although, there are comments in SpinQuic itself that notes certain code paths will lead to deadlocks, but those are all disabled.

ProjectsByJackHe avatar Dec 10 '25 19:12 ProjectsByJackHe

Ok. This might help, but I suspect going from 2sec to 3sec won't be a definitive fix. We should make sure dumps are collected so that next time, we can check the state of pending threads.

guhetier avatar Dec 10 '25 21:12 guhetier