node net: defer self.destroy calls to nextTick

trafficstars

Fixes: https://github.com/nodejs/node/issues/48771

What is the problem being solved?

#48771 Reported request object returned from http.request method cannot catch error events triggered when there’s an immediate failure trying to connect to an address returned from dns lookup.

Solution

#51038 implemented changes suggested in https://github.com/nodejs/node/issues/48771#issuecomment-1661858958. However the #51038 couldn’t be merged due to lack of tests(https://github.com/nodejs/node/pull/51038#pullrequestreview-1761719500). In this PR, I apply the same fix but with some tests.

Testing Considerations

All process.nextTick(() => self.destroy()) are hit except one. Below is the self.destroy() call that is not hit in these tests provided: https://github.com/nodejs/node/blob/9404d3aaaf0a8264ee08ba169ec24685bf8d487d/lib/net.js#L1113

I am not tagging this PR as "DRAFT" since the piece of code that isn't tested is for a connection timeout case.

@mcollina Please let me know if these tests are sufficient or not.

Sep 09 '24 06:09 anilhelvaci

Review requested:

[ ] @nodejs/net

Sep 09 '24 06:09 nodejs-github-bot

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 88.48%. Comparing base (4396cf2) to head (793c722). :warning: Report is 20 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #54857      +/-   ##
==========================================
+ Coverage   88.46%   88.48%   +0.02%     
==========================================
  Files         703      703              
  Lines      207368   207540     +172     
  Branches    39983    40020      +37     
==========================================
+ Hits       183438   183639     +201     
+ Misses      15933    15871      -62     
- Partials     7997     8030      +33

Files with missing lines	Coverage Δ
lib/_http_client.js	`97.36% <100.00%> (+0.02%)`	:arrow_up:

... and 44 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sep 09 '24 08:09 codecov[bot]

CI: https://ci.nodejs.org/job/node-test-pull-request/62213/

Sep 09 '24 23:09 nodejs-github-bot

nice. thank you and congrats on your first contribution!

Thank you, so excited to become a part of the family! @anonrig

What are the next steps then? Do I have to do anything to initiate the merge? I see that my branch is behind a few commits, should I rebase and push again?

Sep 10 '24 12:09 anilhelvaci

What are the next steps then?

Running the CI and getting it to pass. We'll take of running it, in case there are related failures it would up to you to fix them.

Do I have to do anything to initiate the merge? I see that my branch is behind a few commits, should I rebase and push again?

no need unless there are conflicts.

Sep 11 '24 10:09 mcollina

CI: https://ci.nodejs.org/job/node-test-pull-request/62310/

Sep 11 '24 10:09 nodejs-github-bot

Hey @mcollina , thanks for your reply!

So the CI is currently failing. When I look up the details, I see two kinds of failures;

parallel.test-http-client-immidiate-error.js - The file I work on

Stack trace shows that my tests timeout for some reason.

---
duration_ms: 120011.463
exitcode: -15
severity: fail
stack: |-
  timeout
  (node:376527) internal/test/binding: These APIs are for internal testing only. Do not use them.
  (Use `node --trace-warnings ...` to show where the warning was created)
...

parallel.test-runner-watch-mode-complex

For this file, I have no idea why this is failing. Here's the stack trace;

duration_ms: 6031.101
exitcode: 1
severity: fail
stack: "\u25B6 test runner watch mode with more complex setup\n  \u2716 should run\
  \ tests when a dependency changed after a watched test file being deleted (4747.503807ms)\n\
  \    AssertionError [ERR_ASSERTION]: The input did not match the regular expression\
  \ /tests 2/. Input:\n\n    '\u2714 second test has ran (3.238638ms)\\n' +\n    \
  \  '\u2714 first test has ran (4.048804ms)\\n' +\n      '\u2714 first test has ran\
  \ (12.434145ms)\\n' +\n      '\u2139 tests 3\\n' +\n      '\u2139 suites 0\\n' +\n\
  \      '\u2139 pass 3\\n' +\n      '\u2139 fail 0\\n' +\n      '\u2139 cancelled\
  \ 0\\n' +\n      '\u2139 skipped 0\\n' +\n      '\u2139 todo 0\\n' +\n      '\u2139\
  \ duration_ms 1312.515689\\n'\n\n        at TestContext.<anonymous> (file:///home/iojs/build/workspace/node-test-commit-linux-containered/test/parallel/test-runner-watch-mode-complex.mjs:99:12)\n\
  \        at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n\
  \        at async Test.run (node:internal/test_runner/test:888:9)\n        at async\
  \ Promise.all (index 0)\n        at async Suite.run (node:internal/test_runner/test:1268:7)\n\
  \        at async startSubtestAfterBootstrap (node:internal/test_runner/harness:283:3)\
  \ {\n      generatedMessage: true,\n      code: 'ERR_ASSERTION',\n      actual:\
  \ '\u2714 second test has ran (3.238638ms)\\n\u2714 first test has ran (4.048804ms)\\\
  n\u2714 first test has ran (12.434145ms)\\n\u2139 tests 3\\n\u2139 suites 0\\n\u2139\
  \ pass 3\\n\u2139 fail 0\\n\u2139 cancelled 0\\n\u2139 skipped 0\\n\u2139 todo 0\\\
  n...',\n      expected: /tests 2/,\n      operator: 'match'\n    }\n\n\u25B6 test\
  \ runner watch mode with more complex setup (4776.910571ms)\n\u2139 tests 1\n\u2139\
  \ suites 1\n\u2139 pass 0\n\u2139 fail 1\n\u2139 cancelled 0\n\u2139 skipped 0\n\
  \u2139 todo 0\n\u2139 duration_ms 4833.662639\n\n\u2716 failing tests:\n\ntest at\
  \ test/parallel/test-runner-watch-mode-complex.mjs:53:3\n\u2716 should run tests\
  \ when a dependency changed after a watched test file being deleted (4747.503807ms)\n\
  \  AssertionError [ERR_ASSERTION]: The input did not match the regular expression\
  \ /tests 2/. Input:\n\n  '\u2714 second test has ran (3.238638ms)\\n' +\n    '\u2714\
  \ first test has ran (4.048804ms)\\n' +\n    '\u2714 first test has ran (12.434145ms)\\\
  n' +\n    '\u2139 tests 3\\n' +\n    '\u2139 suites 0\\n' +\n    '\u2139 pass 3\\\
  n' +\n    '\u2139 fail 0\\n' +\n    '\u2139 cancelled 0\\n' +\n    '\u2139 skipped\
  \ 0\\n' +\n    '\u2139 todo 0\\n' +\n    '\u2139 duration_ms 1312.515689\\n'\n\n\
  \      at TestContext.<anonymous> (file:///home/iojs/build/workspace/node-test-commit-linux-containered/test/parallel/test-runner-watch-mode-complex.mjs:99:12)\n\
  \      at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n\
  \      at async Test.run (node:internal/test_runner/test:888:9)\n      at async\
  \ Promise.all (index 0)\n      at async Suite.run (node:internal/test_runner/test:1268:7)\n\
  \      at async startSubtestAfterBootstrap (node:internal/test_runner/harness:283:3)\
  \ {\n    generatedMessage: true,\n    code: 'ERR_ASSERTION',\n    actual: '\u2714\
  \ second test has ran (3.238638ms)\\n\u2714 first test has ran (4.048804ms)\\n\u2714\
  \ first test has ran (12.434145ms)\\n\u2139 tests 3\\n\u2139 suites 0\\n\u2139 pass\
  \ 3\\n\u2139 fail 0\\n\u2139 cancelled 0\\n\u2139 skipped 0\\n\u2139 todo 0\\n...',\n\
  \    expected: /tests 2/,\n    operator: 'match'\n  }"

There are other failing tests but their severity is flaky so I assume they are no problem for me.

I rebased my branch on top of the current main and make -j4 test passed. Could use any suggestions on how to debug/reproduce these failures happening in CI 🤔

Sep 11 '24 15:09 anilhelvaci

parallel.test-http-client-immidiate-error.js - The file I work on

Stack trace shows that my tests timeout for some reason.

That's a red flag, we don't want to merge flaky tests in our codebase. A timeout probably means there's a race condition somewhere that you forgot to take into account and result in the test never exiting. To try to reproduce the flakiness locally, you can try running:

tools/test.py --repeat 9999 -t 2 test/parallel/test-http-client-immediate-error.js

Once you have a repro, you can attempt getting a fix ready.

Sep 13 '24 08:09 aduh95

Hey @aduh95 , thanks for the clarifying 🙏

Unfortunately the piece of code you suggested did not reproduce the problem for me. I'm on OSX, can you think of anything else that I can try?

Sep 13 '24 09:09 anilhelvaci

@mcollina FYI, this adds overhead to net to fix a bug in the "legacy" HTTP client. I don't mind per se, but it's not optimal since the bug is not actually in net.

Sep 18 '24 11:09 ronag

CI is failing

Oct 15 '24 16:10 mcollina

Yep, it is @mcollina. Any suggestions on how to reproduce it? Any chance I can shell into the machine(or container) facing problems? Or, are there any snapshot/images I can pull into my machine and try to spin it up?

In other words, what are the usual steps that you guys take when you face a problem in CI?

Oct 15 '24 19:10 anilhelvaci

In this specific case, it seems that any Linux box on top of any virtualization software would do.

For more exotic systems @nodejs/build can provide access.

Oct 15 '24 20:10 mcollina

Could anyone approve the CI workflow please? I've rebased my work on top of latest main.

cc @ShogunPanda @anonrig

Jun 17 '25 09:06 anilhelvaci

@anilhelvaci Done :)

Jun 17 '25 10:06 ShogunPanda

CI: https://ci.nodejs.org/job/node-test-pull-request/67495/

Jun 17 '25 11:06 nodejs-github-bot

The failure is related: https://ci.nodejs.org/job/node-test-commit-linuxone/nodes=rhel8-s390x/50094/testReport/junit/(root)/parallel/test_http_client_immediate_error/.

Jun 17 '25 14:06 mcollina

cc @nodejs/platform-s390 it only fails on s390 it seems

Jul 23 '25 10:07 joyeecheung

CI: https://ci.nodejs.org/job/node-test-pull-request/68149/

Jul 23 '25 10:07 nodejs-github-bot

I don't think this is the right place to solve it. Not blocking since I don't have time to provide more meaningful feedback.

Jul 23 '25 12:07 ronag

I don't think this is the right place to solve it. Not blocking since I don't have time to provide more meaningful feedback.

What do you mean by this is the right place to solve it? Are there any other debugging processes I can follow? @ronag

Jul 23 '25 13:07 anilhelvaci

Seems more like a bug in the http.request module and this is just a workaround/hack in net to solve something that is not part of net. Deferring calls to destroy is not part of the API rquireements.

Jul 23 '25 13:07 ronag

Seems more like a bug in the http.request module and this is just a workaround/hack in net to solve something that is not part of net. Deferring calls to destroy is not part of the API rquireements.

Do you think its possible that the deferred self.destroy calls somehow accumulate and increase loading time of net so that the test times out in a less powerful machine? @ronag

Jul 30 '25 10:07 anilhelvaci

On a second thought, this was supposely designed to avoid this. In fact, that's why destroy() adds a microtick internally. I think there is an undeerlying problem here.

Why does it needs 2 microticks? Will we add 3 in the future?

Oh, didn't destroy already did that. Will look into this one. Meanwhile, I've added some console.logs to the failing test to determine where it hangs. Will squash that commit later. Can I get a request-ci please?

Jul 30 '25 13:07 anilhelvaci

CI: https://ci.nodejs.org/job/node-test-pull-request/68334/

Jul 31 '25 10:07 nodejs-github-bot

Manually started Jenkins CI for @anilhelvaci since the last commit does not have an approval (I took a look at the test and I think it's safe to test them. I'd defer to other reviewers to decide whether the changes are themselves good to land).

Jul 31 '25 10:07 joyeecheung

On a second thought, this was supposely designed to avoid this. In fact, that's why destroy() adds a microtick internally. I think there is an undeerlying problem here.

Why does it needs 2 microticks? Will we add 3 in the future?

I think the reason why we need an additional tick is (as you've spotted before @mcollina) https://github.com/nodejs/node/blob/ff11b5956954e96e7b0917c83ef0d901adf62dcf/lib/_http_client.js#L939-L943

The ClientRequest registers the listeners in the next tick so by the time internalConnect (or internalConnectMultiple) fail immediately there would be no listeners for the socket. I believe deferring calls to the next tick is fine since there are multiple examples of that in net module:

If the lookup is skipped, internalConnect is deferred https://github.com/nodejs/node/blob/ff11b5956954e96e7b0917c83ef0d901adf62dcf/lib/net.js#L1351-L1363

If lookup fails, self.destroy is invoked in the next tick https://github.com/nodejs/node/blob/ff11b5956954e96e7b0917c83ef0d901adf62dcf/lib/net.js#L1417-L1430 https://github.com/nodejs/node/blob/ff11b5956954e96e7b0917c83ef0d901adf62dcf/lib/net.js#L1557-L1559

But to @ronag's point above, I've tried to insert a temporary error handler in Socket.connect like:

  if (cb !== null) {
    this.once('connect', cb);
    this.once('error', err => {
      console.log('Error caught');
      this._hadError = true;
      cb(err, this);
    });
  }

Above code ended up emitting error event more than once upon failures so I added this._hadError = true;. It didn't quite resolve the issues I've run into accross all http tests but I believe we can work around those issues. The question is, Having a callback in the socket for immidiate errors is a valid approach or not? Which, I assume, would solve the problem for other modules dealing with sockets and not just http. What do you guys think?

cc @joyeecheung

Aug 13 '25 07:08 anilhelvaci

I still am of the opinion that there is nothing wrong with net and any change here should be done in the http client code.

Aug 13 '25 09:08 ronag

I still am of the opinion that there is nothing wrong with net and any change here should be done in the http client code.

Are there any API specifications that I can read, other than docs/api, in order for me to draw a more clear line between modules in my head?

And one more thing, why is there process.nextTick here? @ronag https://github.com/nodejs/node/blob/ff11b5956954e96e7b0917c83ef0d901adf62dcf/lib/_http_client.js#L939-L943

Aug 13 '25 09:08 anilhelvaci

The problem with adding one more nextTick is that you can always solve problems by adding "one more." Then we'll get three nextTicks to wait for another module, etc. destroy() already defers emitting the error after a nextTick, so one is already there because it has to be there due to our APIs.

The problem is in http.

Aug 19 '25 06:08 mcollina

node node copied to clipboard

net: defer self.destroy calls to nextTick

What is the problem being solved?

Solution

Testing Considerations

Codecov Report

node
node copied to clipboard