localstorage-retry icon indicating copy to clipboard operation
localstorage-retry copied to clipboard

Event duplication on stressed clients

Open Dahaden opened this issue 4 years ago • 4 comments

Setup:

  • A couple of unsent events in localstorage,
  • Multiple tabs of the same ~~tab~~ domain open, running the segment client,
  • Client machine struggling to meet the time intervals for updating ack

Something we have been seeing recently is when the clients machine is resource constrained, the reclaim method between tabs of the same domain can cause duplication of events up to the 10s of thousands.

We believe this issue would have been made much worse due to the the previous issues where the reclaim mechanism can keep hold of the localstorage reference (rather than switching to in memory when full with everything else).

This seems to not only cause the clients to be slower for longer, but it can cause very slow requests to the servers.

We believe this issue comes about when, for some reason, some tabs fail to update ack consistently, but are able to run the reclaim process. This can lead to queues copying each other at the same time, with cyclic reclaims occurring, or even nested reclaims that are 3 or 4 queues deep.

Dahaden avatar Feb 16 '21 23:02 Dahaden

Recently, this problem has been getting far worse where our customers are now seeing impacts on their side as the reclaim mechanism causes duplications which in turn put more pressure on the CPU and memory of the host machine which can make this issue even worse.

We have created an isolated environment to test this at https://localstorage-retry-tester.bitbucket.io/ where you can spin up many tabs to create multiple queues on the same domain, and send events in an attempt to reproduce the issue.

This is not a reliable way of reproducing the duplication's but we have had a couple of moments where we have seen the duplicate counter go into the 100s or even thousands.

There are a couple of different strategies we have tried to reproduce the issue, but they do not work every time.

  1. Open up lots of tabs and try spamming lots of events,
  2. Open up lots of tabs, and let your computer go into sleep and wake it back up, and
  3. Open up lots of tabs, close the window, then restore.

You may also notice that there is an option between the "Original" queue and the "fixed" queue. The original queue is @segment/localstorage-retry@latest which is the state of master in this repo. The fixed queue is all four of my PRs merged together that I have pushed to this branch of my fork of this repo.

I would love to show you the code for the site above but there are still a couple of blockers internally that are preventing me from sharing, but you can always inspect the code running in your browser.

Please let me know when you can get around to checking out my PRs, our customers are feeling pain from this and we want to work with you to make this repo better :)

Thanks

@bryanmikaelian

Dahaden avatar Mar 11 '21 03:03 Dahaden

I got my team to help me with a blitz to try and break both the original and "fixed" localstorage-retry reclaim mechanisms. We still got a couple of duplicated events, but it was a lot harder and and performance issues generally recovered faster in the "fixed" as opposed to original.

The initial state for most breakages were:

  1. Lots of tabs (above 12 is enough but the more you have the better chances of reproduction I believe),
  2. Send lots of events to where the UI starts to stutter and the 500ms timer is reporting close or above 10 seconds.

Then you can try:

  1. Duplicating the tab by right clicking the tab by right clicking the tab at the top of the browser window and clicking "Duplicate",
  2. Switch focus to lots of other apps,
  3. Switch between tabs.

We also noticed that Firefox made this bug easier to reproduce than Chrome (Sorry havent updated the static app yet to add these findings in, will do this tomorrow).

Dahaden avatar Mar 11 '21 07:03 Dahaden

Thanks for reporting this @Dahaden! Unfortunately, I am no longer involved with this project but these steps to reproduce are very helpful. A couple of our data engineers at my current gig have observed some similar behavior but I am unsure if it is related.

@juliofarah / pooya: Can you confirm this issue?

bryanmikaelian avatar Mar 11 '21 20:03 bryanmikaelian

@bryanmikaelian @Dahaden I have just published version 1.3.0 of this library with @Dahaden 's contributions. Feel free to resolve this issue once you validate this is no longer a problem.

Thanks for the coordination here!

juliofarah avatar Mar 11 '21 23:03 juliofarah