test: mark `test-net-write-fully-async-buffer` as flaky
Refs: https://github.com/nodejs/node/issues/47428
Did we try to investigate? It is not flaky on my machines.
On macOS, I actually get a 1 ‰ failures when running:
python3 tools/test.py -J --repeat=1000 test/parallel/test-net-write-fully-async-buffer.js
16 ‰ when running
python3 tools/test.py -J --repeat=1000 test/parallel/test-net-write-fully-async-hex-string.js
Something is fishy here. I get no failures if I remove the common module.
Something is fishy here. I get no failures if I remove the
commonmodule.
https://github.com/nodejs/node/issues/52964#issuecomment-2106357191 also reported that removing ../common import would remove the flakiness of a different test (i.e. the test no longer times out)
Bisecting point to 1d29d81c69a5e03d99a3d3e597bc0eeed433e47d as the first bad commit.
I think there is a recent bug in V8 or Node.js behind the flakiness of some tests. This is one of those. I think we should not land this.
Something is fishy here. I get no failures if I remove the
commonmodule.#52964 (comment) also reported that removing
../commonimport would remove the flakiness of a different test (i.e. the test no longer times out)
Maybe something to do with the exit hooks common installs?
@richardlau I tried commenting almost everything in common and failures still occur. Note that, as per git bisect no failures occur from 07f481cfcf1153336534dc5d3cd4c5c9770a72be which is the parent commit of https://github.com/nodejs/node/commit/1d29d81c69a5e03d99a3d3e597bc0eeed433e47d.
I meant that it's possible that the V8 update has caused something to have changed during shutdown. common registers Javascript code to be run prior to Node.js exiting. If you've commented out the registering of those hooks then it's probably something else.
If you are referring to process.on('exit', handler) then yes I've commented them.
The behavior described in https://github.com/nodejs/node/issues/52964#issuecomment-2106317993 also occurs for this test. It correctly finishes, the server emits the 'close' event, but the process does not exit.
Adding the --jitless flag fixes the issue.
Commit Queue failed
- Loading data for nodejs/node/pull/52959 ✔ Done loading data for nodejs/node/pull/52959 ----------------------------------- PR info ------------------------------------ Title test: mark `test-net-write-fully-async-buffer` as flaky (#52959) ⚠ Could not retrieve the email or name of the PR author's from user's GitHub profile! Branch aduh95:test-net-write-fully-async-buffer-flaky -> nodejs:main Labels test, needs-ci, commit-queue-squash Commits 2 - test: mark `test-net-write-fully-async-buffer` as flaky - Update test/parallel/parallel.status Committers 2 - Antoine du Hamelhttps://github.com/nodejs/node/actions/runs/9424968514- GitHub PR-URL: https://github.com/nodejs/node/pull/52959 Refs: https://github.com/nodejs/node/issues/47428 Reviewed-By: Benjamin Gruenbaum Reviewed-By: Ulises Gascón ------------------------------ Generated metadata ------------------------------ PR-URL: https://github.com/nodejs/node/pull/52959 Refs: https://github.com/nodejs/node/issues/47428 Reviewed-By: Benjamin Gruenbaum Reviewed-By: Ulises Gascón -------------------------------------------------------------------------------- ℹ This PR was created on Sun, 12 May 2024 10:10:41 GMT ✔ Approvals: 2 ✔ - Benjamin Gruenbaum (@benjamingr) (TSC): https://github.com/nodejs/node/pull/52959#pullrequestreview-2052433668 ✔ - Ulises Gascón (@UlisesGascon): https://github.com/nodejs/node/pull/52959#pullrequestreview-2060894926 ✔ Last GitHub CI successful ✘ No Jenkins CI runs detected -------------------------------------------------------------------------------- ✔ Aborted `git node land` session in /home/runner/work/node/node/.ncu
Is this test still flaky? If so, is this land able?
Yes, it is still flaky. test-net-write-fully-async-hex-string also is. I would still not land it to not forget about it, unless the failure rate is intolerable.
I have been investigating this issue as it has been failing a lot recently, here are my two quick findings:
- I think this may not be a recent V8 bug issue as it was first reported in April 2023
- 1000 runs may not be sufficient, I start to see reliable flakiness when I bump the runs to 10k
Testing locally I spotted an interesting result. Posted this originally over on #54809 ...
Running some local tests on test-net-write-fully-async-hex-string.js ... here are the results I'm getting locally
If I change the allocation size of the Buffer that is written, we can change the likelihood of hitting the timeout.
| Allocation Size | Average Timeouts per 10000 repetitions | Runs | Timeout |
|---|---|---|---|
| 65600 | 8 | 10 | 60s |
| 65580 | 4 | 10 | 60s |
| 65570 | 0 | 10 | 60s |
| 65560 | 0 | 10 | 60s |
| 65550 | 0 | 10 | 60s |
So far this is the only variable that I've been able to identify that reliably alters the rate of timeout on this particular test.
The reason for choosing 65550 is that it is fairly close (but still over the default high water mark for the stream. Setting the highwatermark to 65600 appears to reduce the rate of timeout when using 65600 as the allocation size but does not eliminate it.
This makes me think that the tests that are timing out are due to some intersection between v8 GC and streams backpressure management -- I'm unable to reproduce any timeouts for any allocations smaller than the default high water mark size.
/cc @mcollina @ronag @anonrig for visibility