node icon indicating copy to clipboard operation
node copied to clipboard

test: mark `test-net-write-fully-async-buffer` as flaky

Open aduh95 opened this issue 1 year ago • 17 comments

Refs: https://github.com/nodejs/node/issues/47428

aduh95 avatar May 12 '24 10:05 aduh95

Did we try to investigate? It is not flaky on my machines.

lpinca avatar May 13 '24 19:05 lpinca

On macOS, I actually get a 1 ‰ failures when running:

python3 tools/test.py -J --repeat=1000 test/parallel/test-net-write-fully-async-buffer.js

16 ‰ when running

python3 tools/test.py -J --repeat=1000 test/parallel/test-net-write-fully-async-hex-string.js

lpinca avatar May 13 '24 20:05 lpinca

Something is fishy here. I get no failures if I remove the common module.

lpinca avatar May 14 '24 15:05 lpinca

Something is fishy here. I get no failures if I remove the common module.

https://github.com/nodejs/node/issues/52964#issuecomment-2106357191 also reported that removing ../common import would remove the flakiness of a different test (i.e. the test no longer times out)

aduh95 avatar May 15 '24 08:05 aduh95

Bisecting point to 1d29d81c69a5e03d99a3d3e597bc0eeed433e47d as the first bad commit.

lpinca avatar May 16 '24 06:05 lpinca

I think there is a recent bug in V8 or Node.js behind the flakiness of some tests. This is one of those. I think we should not land this.

lpinca avatar May 16 '24 14:05 lpinca

Something is fishy here. I get no failures if I remove the common module.

#52964 (comment) also reported that removing ../common import would remove the flakiness of a different test (i.e. the test no longer times out)

Maybe something to do with the exit hooks common installs?

richardlau avatar May 16 '24 14:05 richardlau

@richardlau I tried commenting almost everything in common and failures still occur. Note that, as per git bisect no failures occur from 07f481cfcf1153336534dc5d3cd4c5c9770a72be which is the parent commit of https://github.com/nodejs/node/commit/1d29d81c69a5e03d99a3d3e597bc0eeed433e47d.

lpinca avatar May 16 '24 15:05 lpinca

I meant that it's possible that the V8 update has caused something to have changed during shutdown. common registers Javascript code to be run prior to Node.js exiting. If you've commented out the registering of those hooks then it's probably something else.

richardlau avatar May 16 '24 15:05 richardlau

If you are referring to process.on('exit', handler) then yes I've commented them.

lpinca avatar May 16 '24 15:05 lpinca

The behavior described in https://github.com/nodejs/node/issues/52964#issuecomment-2106317993 also occurs for this test. It correctly finishes, the server emits the 'close' event, but the process does not exit.

lpinca avatar May 16 '24 15:05 lpinca

Adding the --jitless flag fixes the issue.

lpinca avatar May 16 '24 19:05 lpinca

Commit Queue failed
- Loading data for nodejs/node/pull/52959
✔  Done loading data for nodejs/node/pull/52959
----------------------------------- PR info ------------------------------------
Title      test: mark `test-net-write-fully-async-buffer` as flaky (#52959)
   ⚠  Could not retrieve the email or name of the PR author's from user's GitHub profile!
Branch     aduh95:test-net-write-fully-async-buffer-flaky -> nodejs:main
Labels     test, needs-ci, commit-queue-squash
Commits    2
 - test: mark `test-net-write-fully-async-buffer` as flaky
 - Update test/parallel/parallel.status
Committers 2
 - Antoine du Hamel 
 - GitHub 
PR-URL: https://github.com/nodejs/node/pull/52959
Refs: https://github.com/nodejs/node/issues/47428
Reviewed-By: Benjamin Gruenbaum 
Reviewed-By: Ulises Gascón 
------------------------------ Generated metadata ------------------------------
PR-URL: https://github.com/nodejs/node/pull/52959
Refs: https://github.com/nodejs/node/issues/47428
Reviewed-By: Benjamin Gruenbaum 
Reviewed-By: Ulises Gascón 
--------------------------------------------------------------------------------
   ℹ  This PR was created on Sun, 12 May 2024 10:10:41 GMT
   ✔  Approvals: 2
   ✔  - Benjamin Gruenbaum (@benjamingr) (TSC): https://github.com/nodejs/node/pull/52959#pullrequestreview-2052433668
   ✔  - Ulises Gascón (@UlisesGascon): https://github.com/nodejs/node/pull/52959#pullrequestreview-2060894926
   ✔  Last GitHub CI successful
   ✘  No Jenkins CI runs detected
--------------------------------------------------------------------------------
   ✔  Aborted `git node land` session in /home/runner/work/node/node/.ncu
https://github.com/nodejs/node/actions/runs/9424968514

nodejs-github-bot avatar Jun 08 '24 01:06 nodejs-github-bot

Is this test still flaky? If so, is this land able?

avivkeller avatar Aug 08 '24 02:08 avivkeller

Yes, it is still flaky. test-net-write-fully-async-hex-string also is. I would still not land it to not forget about it, unless the failure rate is intolerable.

lpinca avatar Aug 08 '24 05:08 lpinca

I have been investigating this issue as it has been failing a lot recently, here are my two quick findings:

  • I think this may not be a recent V8 bug issue as it was first reported in April 2023
  • 1000 runs may not be sufficient, I start to see reliable flakiness when I bump the runs to 10k

jakecastelli avatar Aug 08 '24 12:08 jakecastelli

Testing locally I spotted an interesting result. Posted this originally over on #54809 ...

Running some local tests on test-net-write-fully-async-hex-string.js ... here are the results I'm getting locally

If I change the allocation size of the Buffer that is written, we can change the likelihood of hitting the timeout.

Allocation Size Average Timeouts per 10000 repetitions Runs Timeout
65600 8 10 60s
65580 4 10 60s
65570 0 10 60s
65560 0 10 60s
65550 0 10 60s

So far this is the only variable that I've been able to identify that reliably alters the rate of timeout on this particular test.

The reason for choosing 65550 is that it is fairly close (but still over the default high water mark for the stream. Setting the highwatermark to 65600 appears to reduce the rate of timeout when using 65600 as the allocation size but does not eliminate it.

This makes me think that the tests that are timing out are due to some intersection between v8 GC and streams backpressure management -- I'm unable to reproduce any timeouts for any allocations smaller than the default high water mark size.

/cc @mcollina @ronag @anonrig for visibility

jasnell avatar Sep 09 '24 13:09 jasnell