karma icon indicating copy to clipboard operation
karma copied to clipboard

socket `reconnect` from ping timeout does not work correctly

Open johnjbarton opened this issue 3 years ago • 14 comments

This is a long standing issue I've tried to fix in the past.

A user in my company hit a case that failed 30% of the time or so and he ran down the reason:

  • the client takes a long hunk of time (parsing or running a test)
  • the server hits the pingTimeout, disconnects, waits for reconnect
  • the client sends results << These results are lost
  • the client gets pingTimeout, tries reconnect (unsure)
  • server sees reconnect as a new socket and connects.
  • server never sees results, unpredictable result.

As far as I can tell, any data that the client sends after the server thinks the socket is disconnected is lost. The socket.io layer has no replay and neither does karma.

The problem is fundamental. Either we implement replay or we work to avoid disconnect but fail if we hit it.

Replay is not super hard since we know that tests are finite in duration.

But I think that, for the most part, disconnect is self-inflicted. The client is parsing a huge chunk of JS or it is off doing a long test: it fails to respond to a ping. The browser is there and the runtime is live, but we are starving the ping reply. If we increase the pingTimeout we can avoid the disconnect.

The other ways that the pingTimeout can be hit is browser/tab crash or network partition. With a large pingTimeout we wait a long time to discover the tab crash.

I can't come up with a scenario where automatic socket reconnect makes sense for us. Any time we lose connection, we lose data and the test is invalid. We could say that, eg logging up until the test starts is "optional" so we could reconnect if we lost connection before start. But if we lose connection before start the thing we need to figure out why is exactly those logs.

I think we should flush the entire reconnect logic, all of the timeouts other than pingTimeout, and retry logic in favor error messages directing users to look into the issue and then increase the pingTimeout.

johnjbarton avatar Feb 09 '21 00:02 johnjbarton

We have logic to relaunch browsers (and thus retry the test) if the socket disconnects during a test.

Is this worth having? It's not fundamentally wrong: if the test passes its fine. It seems marginally wrong in that the most common case would be a flaky test that is close to the pingTimeout (too large JS or too long test). The retry might pass if the first one fails, but it means we don't address the underlying problem (pingTimeout too small).

Can the network could be legitimately flaky in some cases these days?

johnjbarton avatar Feb 09 '21 23:02 johnjbarton

There is some replay code: https://github.com/karma-runner/karma/blob/master/lib/server.js#L293 This supports the case where the browser connects then registers and in between it sends some other events. I can't see any way this can matter. Until the execute event is sent from the server to the client, none of those event can happen. And if they did which ones did we miss before the connect event?

johnjbarton avatar Feb 10 '21 00:02 johnjbarton

A co-worker and myself as new users of karma ran into this issue (Disconnected reconnect failed before timeout of 2000ms (ping timeout)) many times during testing. We pulled John's PR and after some pretty exhaustive testing, I can say it has 100% resolved our problems. Thank you! Looking forward to this getting merged.

trevorgithub avatar Feb 20 '21 01:02 trevorgithub

@johnjbarton, any update on this? The reason my team recently adopted Karma is so that we could leverage the integration with Istanbul to generate coverage reports for our suite of Jasmine tests. When using Karma with headless Chrome via Puppeteer, the current Karma code fails far more often than not at the very end of the test run (with the ping timeout error). As a result, the Istanbul tests are not written to disk. For such a use case, the pull request completely resolves the problem (based on our testing) and is essential to generating coverage reports.

trevorgithub avatar Mar 10 '21 17:03 trevorgithub

I just want to be sure that all karma 6.0 issues are resolved before we go ahead. This reconnect code is only triggered when things go bad so I don't trust our test coverage. I appreciate your efforts and I will try to pick up the pace.

johnjbarton avatar Mar 10 '21 23:03 johnjbarton

@johnjbarton, just wanted to check in on this one. I appreciate it's only one issue of many the team is focused on, so I understand it can take some time.

I also wanted to update you that today I pulled the disconnect branch from your forked repo, and tested again. Unfortunately, the original problem has resurfaced. When I originally tested the fix you made in February, it was rock solid. Today when looking at the git log on that disconnect branch, I can see your fix at the top of the disconnect branch, but a number of other commits with a more recent date beneath it the git log. I guess maybe you rebased with the base branch, and perhaps something was incompatible with your changes. Perhaps this is something you already know, but I wanted to bring it to your attention in case it's news. Thanks!

trevorgithub avatar May 06 '21 23:05 trevorgithub

Hi @trevorgithub , John has left the team and probably won't be active on Karma project. @jginsburgn and I (@gjyalpha ) will keep working on the Karma open source project. Well, due to other priorities, I'm not sure when we will get another chance to look on this yet. Will keep this thread posted if we have any update. Thanks!

gjyalpha avatar May 07 '21 17:05 gjyalpha

@gjyalpha , thank you very much for the prompt feedback, it is much appreciated. I'm saddened to hear that work may be lost (or at least postponed) but I am very much grateful for the information.

trevorgithub avatar May 07 '21 19:05 trevorgithub

@trevorgithub can you please share a log from the failing run using the latest PR build with the --log-level debug flag? The code in the PR looks correct, so I wonder if we can get some insights into why it didn't work for you?

devoto13 avatar May 12 '21 07:05 devoto13

Actually, please wait a bit. I see that the last build in PR failed, I'll fix it in the coming days and then ask you to re-test. Sorry for the noise!

devoto13 avatar May 12 '21 07:05 devoto13

Two or three e2e (cucumber) tests fail. (I was debugging these on my last day when the power on our block went out for 4 hours so I was unable to finish).

As far as I can tell, these fails indicate that the PR is not fully correct. The failing cases seem to involve 1) launching the server 2) running additional command line commands. These additional commands fail because the server has exited. Since part of the change involves the logic to exit when the number of browsers goes to zero, I think that the related (no)single run logic is broken. Since I've never used the related command-line features I was uncertain about that logic anyway.

I planned to take another look but I'm unsure when I will have a chance. So far my retirement hasn't left me with any free time ;-)

jjb

On Wed, May 12, 2021 at 12:52 AM Yaroslav Admin @.***> wrote:

Actually, please wait a bit. I see that the last build in PR failed, I'll fix it in the coming days and then ask you to re-test. Sorry for the noise!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/karma-runner/karma/issues/3652#issuecomment-839548823, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABSGAW5W5JURJLPN5IUDYDTNIXTZANCNFSM4XJ6XLHQ .

johnjbarton avatar May 12 '21 13:05 johnjbarton

Thanks so much for taking a look @johnjbarton. Enjoy your retirement!

@devoto13: Do you want me to take a look at this?

jginsburgn avatar May 15 '21 00:05 jginsburgn

@jginsburgn Yes, once I have some time to spare :)

devoto13 avatar May 16 '21 17:05 devoto13

Could this issue be why sometimes onBrowserStart is not fired, breaking reporters? I made a comment about junit and progress reporter here: https://github.com/karma-runner/karma-junit-reporter/issues/188#issuecomment-989624164

Perhaps the following occurs:

  • Karma client emits start and sets startEmitted=true
  • socket gets closed
  • start message is not delivered to the Karma server and lost
  • The socket reconnects and client now emits register with isSocketReconnect=true
  • Server browser.reconnect is executed and does not trigger execute...
  • Client will never send start because startEmitted === true

jr01 avatar Dec 11 '21 14:12 jr01