node icon indicating copy to clipboard operation
node copied to clipboard

Revert "net: enable autoSelectFamily by default"

Open mcollina opened this issue 2 years ago • 38 comments

This reverts commit 8b51c1a8696ae263141d18b39cbd144d2866f7c7.

I think we were too eager to turn this on. In the Fastify community, we have been trying to get our tests to pass on Node v20 for a month and a half with minimal luck. The only solution was to use NODE_OPTIONS=no-network-family-autoselection.

Note that tests spuriously fail even with v20.3.0, so the fixes are not complete. I was not able to reproduce the problem individually. My 2 cents is that there is a race condition somewhere triggered by many open sockets in a short period of time.

mcollina avatar Jun 09 '23 12:06 mcollina

Review requested:

  • [ ] @nodejs/net

nodejs-github-bot avatar Jun 09 '23 12:06 nodejs-github-bot

cc @nodejs/releasers @nodejs/tsc

This is technically semver-major but it's useful only if it is backported to Node v20.

mcollina avatar Jun 09 '23 12:06 mcollina

We might want to land this only to v20.x.

mcollina avatar Jun 09 '23 12:06 mcollina

CI: https://ci.nodejs.org/job/node-test-pull-request/52179/

nodejs-github-bot avatar Jun 09 '23 13:06 nodejs-github-bot

For context, there are related discussions in https://github.com/nodejs/node/issues/47644, #48028, and https://github.com/nodejs/node/issues/48145. @silverwind suggested reverting in https://github.com/nodejs/node/issues/48145, and other collaborators have mentioned that they'd have preferred not to enable Happy Eyeballs in the first place. However, according to @ShogunPanda, all known bugs (aside from deviating from the RFC) are fixed in 20.3.0. Do the fastify issues persist with 20.3.0 @mcollina?

tniessen avatar Jun 09 '23 13:06 tniessen

Yes, it persist with 20.3.0. But it seems to be a different (possibly partially unrelated) issue as none of the known bugs seems to surface up. I'm debugging this as we speak.

ShogunPanda avatar Jun 09 '23 14:06 ShogunPanda

About having this enabled by default: I'm a bit conflicted on this. Regardless of the implementation, I wonder if will cause more harm than good to users.

To clarify, it is good that we had lot of bugs reports and we were able to fix them. But I start thinking a feature like this has too many network implications that probably we should have it disable by default and always have user explicitly opt-in when they need it.

ShogunPanda avatar Jun 09 '23 14:06 ShogunPanda

My preference would be to revert only for v20 and working out the remaining issues on v21+ (https://github.com/nodejs/node/issues/48145 would be a must-have for me).

silverwind avatar Jun 09 '23 14:06 silverwind

I don't understand the parallel connection request. According to Happy Eyeballs RFC 8305, section 5, connections attempts SHOULD NOT be made in parallel. Why do we want to deviate from this?

ShogunPanda avatar Jun 09 '23 14:06 ShogunPanda

I think you are misinterpreting the RFC. It just says they should not start in parallel (meaning true concurrency, e.g. multiple CPUs), but they can and should be pending at the same time. Essentially the connections are raced against each other. At least that is the whole point of Happy Eyeballs to my understanding and Wikipedia seems to agree.

silverwind avatar Jun 09 '23 14:06 silverwind

I'm not quite sure about it. The RFC complete sentence is:

In order to avoid unreasonable network load, connection attempts SHOULD NOT be made simultaneously.

So it's not referring to just resources on the originating machine. If we implement like that, each new connection will result in at least two connection attempts. This will affect source machine, routers, ISP and so forth.

ShogunPanda avatar Jun 09 '23 14:06 ShogunPanda

I'm not quite sure about it. The RFC complete sentence is:

In order to avoid unreasonable network load, connection attempts SHOULD NOT be made simultaneously.

I think the RFC authors mean that multiple connection attempts should not be started simultaneously. A good implementation should base the delay between attempts on the SYN retransmission, so at most two SYN packets would be sent at a time in a proper parallel implementation.

@silverwind is right. The important sentence from the RFC is this:

Starting a new connection attempt does not affect previous attempts, as multiple connection attempts may occur in parallel.

tniessen avatar Jun 09 '23 16:06 tniessen

The discussion around parallel connections should happen in https://github.com/nodejs/node/issues/48145. As far as this PR goes, there are test failures, so presumably either more commits need to be reverted or additional commits to fix the tests.

richardlau avatar Jun 09 '23 16:06 richardlau

@richardlau Sure, we will move the discussion there.

@mcollina I checked your fastify failure and I made several tests using main branch. The problems seems not to be in family autoselection (as I manually disabled in the code) but in the presence of NODE_OPTIONS being set to any value. I know it makes no sense and I kinda hope somebody proves me wrong. As a proof of this, you can have a green run by running NODE_OPTIONS=whatever tap

ShogunPanda avatar Jun 10 '23 05:06 ShogunPanda

I confirmed in private with @mcollina. The fastify failures are unrelated to family auto selection and it seems to be a weird combination between the tap ecosystem, Node and NODE_OPTIONS.

Do we want to close this PR or do we still want to proceed?

ShogunPanda avatar Jun 10 '23 10:06 ShogunPanda

I don't have a strong opinion. The current implementation is flawed, apparently both due to technical reasons and because of disagreement w.r.t. how the RFC should be interpreted, and most importantly, it does introduce new network problems. On the other hand, it also solves some historic (and somehow still relevant) network problems. Myself and others would have preferred autoSelectFamily to be opt-in instead of opt-out to keep the stable and predictable pre-20 behavior by default, but others approved #46790 so 🤷.

tniessen avatar Jun 10 '23 11:06 tniessen

According to https://github.com/KararTY/dank-twitch-irc/issues/13#issuecomment-1589118702, the ERR_INTERNAL_ASSERTION bug still exists in 20.3.0 despite repeated claims that this release would fix all known bugs. Every day, users open new duplicate bug reports (three within the last two hours alone). @ShogunPanda suggested in https://github.com/nodejs/node/issues/47644#issuecomment-1579958420:

The best course of action, IMVHO, would be to wait for 20.3.0 to settle a little bit and see if bug reports stop. If not we disable it.

Does this imply that you are in favor of disabling the feature @ShogunPanda?

tniessen avatar Jun 13 '23 11:06 tniessen

@tniessen To be honest no.

I checked https://github.com/KararTY/dank-twitch-irc/issues/13#issuecomment-1589118702 and I could not reproduce the issue. All the other bug reports you mentioned are lacking the version so it's impossible to see whether they were in 20.3.0 or not.

Other than the annoying duplicate bug reports, may I ask you why are you so eager in disabling this?

ShogunPanda avatar Jun 13 '23 13:06 ShogunPanda

Other than the annoying duplicate bug reports, may I ask you why are you so eager in disabling this?

Because each "annoying duplicate bug report", as you put it, represents at least one but potentially even a large number of negatively affected users, whereas we have no reliable metrics for the benefits of the feature. (To be fair, I am no expert in this matter so I tend to rely on others' expertise, and I have yet to meet someone more knowledgeable than Ben, for example, who was against enabling this by default.)

tniessen avatar Jun 13 '23 14:06 tniessen

I see. In my opinion we should try to solve the bugs around the issue rather than "hiding it under the carpet" (which is what I fear will happen if we disabled it by default, as I'm not able to reproduce these issues locally and thus those report are really valuable).

ShogunPanda avatar Jun 13 '23 14:06 ShogunPanda

In this case I think it might still be okay to wait for maybe a week to confirm that this still reproduces on v20.3.0. But in retrospect I think we should've reverted this a month ago. When we notice any large-scale regression, it's usually never really worth waiting for a non-trivial fix. Reverting and do a patch release ASAP and then wait for a fix to reland is almost always better. The more we wait the more users get frustrated and will have to add workarounds to their code/environment which are meant to go away.

joyeecheung avatar Jun 13 '23 14:06 joyeecheung

I couldn't agree more @joyeecheung. We should have reverted much earlier and postponed to Node.js 21, which also won't become LTS, or perhaps indefinitely.

In this case I think it might still be okay to wait for maybe a week to confirm that this still reproduces on v20.3.0

It has been confirmed, see https://github.com/KararTY/dank-twitch-irc/issues/13#issuecomment-1591251579. But I guess the intention is to fix yet another bug and hope that that finally resolves the issue.

tniessen avatar Jun 14 '23 13:06 tniessen

@tniessen I already fixed that bug. AFAIK there is only https://github.com/npm/cli/issues/6409 (which I'm working on at the moment) and then I have nothing else reported. Am I missing any bug?

ShogunPanda avatar Jun 14 '23 14:06 ShogunPanda

I'm talking about the bug discussed in https://github.com/KararTY/dank-twitch-irc/issues/13, which was supposed to be fixed in 20.3.0 :)

tniessen avatar Jun 14 '23 14:06 tniessen

Lol, yes. I know. Coincidentally, two bugs failed in the same assertion. I fixed the first one and now I'm taking care of the second one.

ShogunPanda avatar Jun 14 '23 14:06 ShogunPanda

If we still have bugs that are pending a fix, I think the best thing to do is have a release ASAP that turns it off by default again. The more we wait the more users are going to be bothered by regressions. Improving the implementation of a new feature is not a good justification for regressions, IMO.

(Do we really need a full revert like what this PR does though? It seems to me we just need to set the CLI option to false by default again)

joyeecheung avatar Jun 14 '23 14:06 joyeecheung

@joyeecheung The option was renamed when I turned it on so yes, we need a full revert unfortunately.

Anyway, just for context, I'd like to emphasize that the option existed until the full course of 19 and no bug reports came in. I'm afraid that if we turn this off we will never be able to turn it on reliably anymore.

ShogunPanda avatar Jun 14 '23 14:06 ShogunPanda

I don't have a strong opinion either, but It's unlikely to have this PR ready for this week and have it released due to the upcoming security release. Considering the main purpose of this issue was to prevent connectivity issues within IPV6, I feel this is something we'll need to have at some moment.

As stated earlier in this thread, if we had to revert, we should have rolled back weeks ago. Reverting it now into a breaking change isn't great IMO.

RafaelGSS avatar Jun 14 '23 15:06 RafaelGSS

@joyeecheung The option was renamed when I turned it on so yes, we need a full revert unfortunately.

Anyway, just for context, I'd like to emphasize that the option existed until the full course of 19 and no bug reports came in. I'm afraid that if we turn this off we will never be able to turn it on reliably anymore.

How about only disabling in on the v20 branch? I guess v21 can accept a bit more instability as you have seen that v19 barely got any real testing, and that will give plenty of time to fix the issues before v22.

silverwind avatar Jun 14 '23 15:06 silverwind

That might be an acceptable solution, but I'm still pretty skeptical about the approach. I think we might be able to solve all the (major) issues in few weeks. So far, AFAIK, I only have two left, one already solved locally.

I totally agree that we should have reverted this long ago probably (and I apologize about this), but now the pro and cons of a breaking change lean towards the cons.

ShogunPanda avatar Jun 14 '23 17:06 ShogunPanda