node
node copied to clipboard
net: defer self.destroy calls to nextTick
Fixes: https://github.com/nodejs/node/issues/48771
What is the problem being solved?
#48771 Reported request object returned from http.request method cannot catch error events triggered when there’s an immediate failure trying to connect to an address returned from dns lookup.
Solution
#51038 implemented changes suggested in https://github.com/nodejs/node/issues/48771#issuecomment-1661858958. However the #51038 couldn’t be merged due to lack of tests(https://github.com/nodejs/node/pull/51038#pullrequestreview-1761719500). In this PR, I apply the same fix but with some tests.
Testing Considerations
All process.nextTick(() => self.destroy()) are hit except one. Below is the self.destroy() call that is not hit in these tests provided:
https://github.com/nodejs/node/blob/9404d3aaaf0a8264ee08ba169ec24685bf8d487d/lib/net.js#L1113
I am not tagging this PR as "DRAFT" since the piece of code that isn't tested is for a connection timeout case.
@mcollina Please let me know if these tests are sufficient or not.
Review requested:
- [ ] @nodejs/net
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 88.48%. Comparing base (4396cf2) to head (793c722).
:warning: Report is 20 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #54857 +/- ##
==========================================
+ Coverage 88.46% 88.48% +0.02%
==========================================
Files 703 703
Lines 207368 207540 +172
Branches 39983 40020 +37
==========================================
+ Hits 183438 183639 +201
+ Misses 15933 15871 -62
- Partials 7997 8030 +33
| Files with missing lines | Coverage Δ | |
|---|---|---|
| lib/_http_client.js | 97.36% <100.00%> (+0.02%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
CI: https://ci.nodejs.org/job/node-test-pull-request/62213/
nice. thank you and congrats on your first contribution!
Thank you, so excited to become a part of the family! @anonrig
What are the next steps then? Do I have to do anything to initiate the merge? I see that my branch is behind a few commits, should I rebase and push again?
What are the next steps then?
Running the CI and getting it to pass. We'll take of running it, in case there are related failures it would up to you to fix them.
Do I have to do anything to initiate the merge? I see that my branch is behind a few commits, should I rebase and push again?
no need unless there are conflicts.
CI: https://ci.nodejs.org/job/node-test-pull-request/62310/
Hey @mcollina , thanks for your reply!
So the CI is currently failing. When I look up the details, I see two kinds of failures;
- parallel.test-http-client-immidiate-error.js - The file I work on
Stack trace shows that my tests timeout for some reason.
---
duration_ms: 120011.463
exitcode: -15
severity: fail
stack: |-
timeout
(node:376527) internal/test/binding: These APIs are for internal testing only. Do not use them.
(Use `node --trace-warnings ...` to show where the warning was created)
...
For this file, I have no idea why this is failing. Here's the stack trace;
duration_ms: 6031.101
exitcode: 1
severity: fail
stack: "\u25B6 test runner watch mode with more complex setup\n \u2716 should run\
\ tests when a dependency changed after a watched test file being deleted (4747.503807ms)\n\
\ AssertionError [ERR_ASSERTION]: The input did not match the regular expression\
\ /tests 2/. Input:\n\n '\u2714 second test has ran (3.238638ms)\\n' +\n \
\ '\u2714 first test has ran (4.048804ms)\\n' +\n '\u2714 first test has ran\
\ (12.434145ms)\\n' +\n '\u2139 tests 3\\n' +\n '\u2139 suites 0\\n' +\n\
\ '\u2139 pass 3\\n' +\n '\u2139 fail 0\\n' +\n '\u2139 cancelled\
\ 0\\n' +\n '\u2139 skipped 0\\n' +\n '\u2139 todo 0\\n' +\n '\u2139\
\ duration_ms 1312.515689\\n'\n\n at TestContext.<anonymous> (file:///home/iojs/build/workspace/node-test-commit-linux-containered/test/parallel/test-runner-watch-mode-complex.mjs:99:12)\n\
\ at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n\
\ at async Test.run (node:internal/test_runner/test:888:9)\n at async\
\ Promise.all (index 0)\n at async Suite.run (node:internal/test_runner/test:1268:7)\n\
\ at async startSubtestAfterBootstrap (node:internal/test_runner/harness:283:3)\
\ {\n generatedMessage: true,\n code: 'ERR_ASSERTION',\n actual:\
\ '\u2714 second test has ran (3.238638ms)\\n\u2714 first test has ran (4.048804ms)\\\
n\u2714 first test has ran (12.434145ms)\\n\u2139 tests 3\\n\u2139 suites 0\\n\u2139\
\ pass 3\\n\u2139 fail 0\\n\u2139 cancelled 0\\n\u2139 skipped 0\\n\u2139 todo 0\\\
n...',\n expected: /tests 2/,\n operator: 'match'\n }\n\n\u25B6 test\
\ runner watch mode with more complex setup (4776.910571ms)\n\u2139 tests 1\n\u2139\
\ suites 1\n\u2139 pass 0\n\u2139 fail 1\n\u2139 cancelled 0\n\u2139 skipped 0\n\
\u2139 todo 0\n\u2139 duration_ms 4833.662639\n\n\u2716 failing tests:\n\ntest at\
\ test/parallel/test-runner-watch-mode-complex.mjs:53:3\n\u2716 should run tests\
\ when a dependency changed after a watched test file being deleted (4747.503807ms)\n\
\ AssertionError [ERR_ASSERTION]: The input did not match the regular expression\
\ /tests 2/. Input:\n\n '\u2714 second test has ran (3.238638ms)\\n' +\n '\u2714\
\ first test has ran (4.048804ms)\\n' +\n '\u2714 first test has ran (12.434145ms)\\\
n' +\n '\u2139 tests 3\\n' +\n '\u2139 suites 0\\n' +\n '\u2139 pass 3\\\
n' +\n '\u2139 fail 0\\n' +\n '\u2139 cancelled 0\\n' +\n '\u2139 skipped\
\ 0\\n' +\n '\u2139 todo 0\\n' +\n '\u2139 duration_ms 1312.515689\\n'\n\n\
\ at TestContext.<anonymous> (file:///home/iojs/build/workspace/node-test-commit-linux-containered/test/parallel/test-runner-watch-mode-complex.mjs:99:12)\n\
\ at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n\
\ at async Test.run (node:internal/test_runner/test:888:9)\n at async\
\ Promise.all (index 0)\n at async Suite.run (node:internal/test_runner/test:1268:7)\n\
\ at async startSubtestAfterBootstrap (node:internal/test_runner/harness:283:3)\
\ {\n generatedMessage: true,\n code: 'ERR_ASSERTION',\n actual: '\u2714\
\ second test has ran (3.238638ms)\\n\u2714 first test has ran (4.048804ms)\\n\u2714\
\ first test has ran (12.434145ms)\\n\u2139 tests 3\\n\u2139 suites 0\\n\u2139 pass\
\ 3\\n\u2139 fail 0\\n\u2139 cancelled 0\\n\u2139 skipped 0\\n\u2139 todo 0\\n...',\n\
\ expected: /tests 2/,\n operator: 'match'\n }"
There are other failing tests but their severity is flaky so I assume they are no problem for me.
I rebased my branch on top of the current main and make -j4 test passed. Could use any suggestions on how to debug/reproduce these failures happening in CI 🤔
- parallel.test-http-client-immidiate-error.js - The file I work on
Stack trace shows that my tests timeout for some reason.
That's a red flag, we don't want to merge flaky tests in our codebase. A timeout probably means there's a race condition somewhere that you forgot to take into account and result in the test never exiting. To try to reproduce the flakiness locally, you can try running:
tools/test.py --repeat 9999 -t 2 test/parallel/test-http-client-immediate-error.js
Once you have a repro, you can attempt getting a fix ready.
Hey @aduh95 , thanks for the clarifying 🙏
Unfortunately the piece of code you suggested did not reproduce the problem for me. I'm on OSX, can you think of anything else that I can try?
@mcollina FYI, this adds overhead to net to fix a bug in the "legacy" HTTP client. I don't mind per se, but it's not optimal since the bug is not actually in net.
CI is failing
Yep, it is @mcollina. Any suggestions on how to reproduce it? Any chance I can shell into the machine(or container) facing problems? Or, are there any snapshot/images I can pull into my machine and try to spin it up?
In other words, what are the usual steps that you guys take when you face a problem in CI?
In this specific case, it seems that any Linux box on top of any virtualization software would do.
For more exotic systems @nodejs/build can provide access.
Could anyone approve the CI workflow please? I've rebased my work on top of latest main.
cc @ShogunPanda @anonrig
@anilhelvaci Done :)
CI: https://ci.nodejs.org/job/node-test-pull-request/67495/
The failure is related: https://ci.nodejs.org/job/node-test-commit-linuxone/nodes=rhel8-s390x/50094/testReport/junit/(root)/parallel/test_http_client_immediate_error/.
cc @nodejs/platform-s390 it only fails on s390 it seems
CI: https://ci.nodejs.org/job/node-test-pull-request/68149/
I don't think this is the right place to solve it. Not blocking since I don't have time to provide more meaningful feedback.
I don't think this is the right place to solve it. Not blocking since I don't have time to provide more meaningful feedback.
What do you mean by this is the right place to solve it? Are there any other debugging processes I can follow? @ronag
Seems more like a bug in the http.request module and this is just a workaround/hack in net to solve something that is not part of net. Deferring calls to destroy is not part of the API rquireements.
Seems more like a bug in the http.request module and this is just a workaround/hack in
netto solve something that is not part ofnet. Deferring calls todestroyis not part of the API rquireements.
Do you think its possible that the deferred self.destroy calls somehow accumulate and increase loading time of net so that the test times out in a less powerful machine? @ronag
On a second thought, this was supposely designed to avoid this. In fact, that's why
destroy()adds a microtick internally. I think there is an undeerlying problem here.Why does it needs 2 microticks? Will we add 3 in the future?
Oh, didn't destroy already did that. Will look into this one. Meanwhile, I've added some console.logs to the failing test to determine where it hangs. Will squash that commit later. Can I get a request-ci please?
CI: https://ci.nodejs.org/job/node-test-pull-request/68334/
Manually started Jenkins CI for @anilhelvaci since the last commit does not have an approval (I took a look at the test and I think it's safe to test them. I'd defer to other reviewers to decide whether the changes are themselves good to land).
On a second thought, this was supposely designed to avoid this. In fact, that's why
destroy()adds a microtick internally. I think there is an undeerlying problem here.Why does it needs 2 microticks? Will we add 3 in the future?
I think the reason why we need an additional tick is (as you've spotted before @mcollina) https://github.com/nodejs/node/blob/ff11b5956954e96e7b0917c83ef0d901adf62dcf/lib/_http_client.js#L939-L943
The ClientRequest registers the listeners in the next tick so by the time internalConnect (or internalConnectMultiple) fail immediately there would be no listeners for the socket. I believe deferring calls to the next tick is fine since there are multiple examples of that in net module:
If the lookup is skipped, internalConnect is deferred
https://github.com/nodejs/node/blob/ff11b5956954e96e7b0917c83ef0d901adf62dcf/lib/net.js#L1351-L1363
If lookup fails, self.destroy is invoked in the next tick
https://github.com/nodejs/node/blob/ff11b5956954e96e7b0917c83ef0d901adf62dcf/lib/net.js#L1417-L1430
https://github.com/nodejs/node/blob/ff11b5956954e96e7b0917c83ef0d901adf62dcf/lib/net.js#L1557-L1559
But to @ronag's point above, I've tried to insert a temporary error handler in Socket.connect like:
if (cb !== null) {
this.once('connect', cb);
this.once('error', err => {
console.log('Error caught');
this._hadError = true;
cb(err, this);
});
}
Above code ended up emitting error event more than once upon failures so I added this._hadError = true;. It didn't quite resolve the issues I've run into accross all http tests but I believe we can work around those issues. The question is, Having a callback in the socket for immidiate errors is a valid approach or not? Which, I assume, would solve the problem for other modules dealing with sockets and not just http. What do you guys think?
cc @joyeecheung
I still am of the opinion that there is nothing wrong with net and any change here should be done in the http client code.
I still am of the opinion that there is nothing wrong with net and any change here should be done in the http client code.
Are there any API specifications that I can read, other than docs/api, in order for me to draw a more clear line between modules in my head?
And one more thing, why is there process.nextTick here? @ronag
https://github.com/nodejs/node/blob/ff11b5956954e96e7b0917c83ef0d901adf62dcf/lib/_http_client.js#L939-L943
The problem with adding one more nextTick is that you can always solve problems by adding "one more." Then we'll get three nextTicks to wait for another module, etc. destroy() already defers emitting the error after a nextTick, so one is already there because it has to be there due to our APIs.
The problem is in http.