App icon indicating copy to clipboard operation
App copied to clipboard

[HOLD #53826] Not receiving realtime updates to desktop/web session

Open m-natarajan opened this issue 1 year ago • 23 comments

If you haven’t already, check out our contributing guidelines for onboarding and email [email protected] to request to join our Slack channel!


Version Number: Reproducible in staging?: Needs Reproduction Reproducible in production?: Needs Reproduction If this was caught on HybridApp, is this reproducible on New Expensify Standalone?: If this was caught during regression testing, add the test name, ID and link from TestRail: Email or phone of affected tester (no customers): Logs: https://stackoverflow.com/c/expensify/questions/4856 Expensify/Expensify Issue URL: Issue reported by: @quinthar Slack conversation (hyperlinked to channel name): ts_external_expensify_quality

Action Performed:

  1. Login to staging.new.expensify.com as user A
  2. As user B send messages to user A

Expected Result:

User A receives message in real time

Actual Result:

For user A typing indicator displayed, not receiving realtime updates to desktop/web session, but receiving push notifications in mobile for the same

Workaround:

Can the user still use Expensify without this being fixed? Have you informed them of the workaround?

Platforms:

Which of our officially supported platforms is this issue occurring on?

  • [ ] Android: Standalone
  • [ ] Android: HybridApp
  • [ ] Android: mWeb Chrome
  • [ ] iOS: Standalone
  • [ ] iOS: HybridApp
  • [ ] iOS: mWeb Safari
  • [x] MacOS: Chrome / Safari
  • [ ] MacOS: Desktop

Screenshots/Videos

Add any screenshot/video evidence

image (18)

image (19)

image (20)

image (21)

https://github.com/user-attachments/assets/7b474301-da0e-40c3-b26a-188569db9537

View all open jobs on GitHub

Issue OwnerCurrent Issue Owner: @deetergp

m-natarajan avatar Nov 12 '24 23:11 m-natarajan

Triggered auto assignment to @deetergp (AutoAssignerNewDotQuality)

melvin-bot[bot] avatar Nov 12 '24 23:11 melvin-bot[bot]

Triggered auto assignment to @trjExpensify (Bug), see https://stackoverflow.com/c/expensify/questions/14418 for more details. Please add this bug to a GH project, as outlined in the SO.

melvin-bot[bot] avatar Nov 12 '24 23:11 melvin-bot[bot]

This has been labelled "Needs Reproduction". Follow the steps here: https://stackoverflowteams.com/c/expensify/questions/16989

MelvinBot avatar Nov 12 '24 23:11 MelvinBot

@deetergp I'm assuming notification issues like this need to remain internal, but let me know if you don't think so and we can ask a C+ to get involved as a next step to try and reproduce.

I seemingly can't repro this myself. Question from the thread is: "Why isn't the ping/ping detecting and fixing this?"

trjExpensify avatar Nov 13 '24 02:11 trjExpensify

This happened again; I can't figure out how to reproduce reliably though.

quinthar avatar Nov 16 '24 01:11 quinthar

@deetergp, @trjExpensify Huh... This is 4 days overdue. Who can take care of this?

melvin-bot[bot] avatar Nov 18 '24 09:11 melvin-bot[bot]

@deetergp thoughts on the above, will you be able to look at this today?

CC: @muttmuure I think this one is in the CRITICAL category for #quality, so I've moved it there.

trjExpensify avatar Nov 18 '24 11:11 trjExpensify

Great!

On Mon, 18 Nov 2024 at 11:35, Tom Rhys Jones @.***> wrote:

@deetergp https://github.com/deetergp thoughts on the above, will you be able to look at this today?

CC: @muttmuure https://github.com/muttmuure I think this one is in the CRITICAL category for #quality, so I've moved it there.

— Reply to this email directly, view it on GitHub https://github.com/Expensify/App/issues/52437#issuecomment-2482786721, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALGRREHAPCVDK6DGO5NCLVT2BHGI7AVCNFSM6AAAAABRVGUUOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBSG44DMNZSGE . You are receiving this because you were mentioned.Message ID: @.***>

muttmuure avatar Nov 18 '24 12:11 muttmuure

@trjExpensify I've spent a bit of time with this today and I also cannot seem to reproduce it. I've been having a protracted conversation between the ExpensiScotts (-fy.com & -fail.com) in splitscreen browser windows and they both come through fine. I'm looking at DM chat between DB & Kadie to see if there's anything "off" about what's in Auth and in the logs.

deetergp avatar Nov 18 '24 22:11 deetergp

Gotcha. I'm sure DB would be happy to live debug or something, if you want to take it to the thread: https://expensify.slack.com/archives/C05LX9D6E07/p1731449676200089?thread_ts=1731299637.345689&cid=C05LX9D6E07

trjExpensify avatar Nov 18 '24 23:11 trjExpensify

@deetergp, @trjExpensify Whoops! This issue is 2 days overdue. Let's get this updated quick!

melvin-bot[bot] avatar Nov 22 '24 09:11 melvin-bot[bot]

Spent a bit of time looking into this today and it interesting. A log search for blob:"PusherError" returns tens of thousands of results for just the last 24 hours. They all have the 1006 error code which Pusher's documentation has this to say about it:

When a WebSocket connection is closed without a "close frame", the pusher-js library emits an error with code 1006. Usually this is caused by WebSocket-incompatible proxies, which can't close the connection in the correct way.

Looking specifically into @quinthar's logs, I see an interesting 1006 log line that pops up: Software caused connection abort. Between my own searching and ChatGPT, it sounds like poor network connectivity can be a culprit, as can "Version or Library Mismatch". I found some GH issues from 2021 that talk about needing to be on the latest (for the time) version of 9.x. Looking in our package.json file, it looks like we are on v 8.3.0. Maybe we need to update the version of the pusher client we are using?

I'm not sure how involved updating to a newer version might be, maybe @mountiny or @AndrewGable might have some insight?

deetergp avatar Nov 25 '24 23:11 deetergp

@deetergp @trjExpensify this issue was created 2 weeks ago. Are we close to a solution? Let's make sure we're treating this as a top priority. Don't hesitate to create a thread in #expensify-open-source to align faster in real time. Thanks!

melvin-bot[bot] avatar Nov 26 '24 09:11 melvin-bot[bot]

@deetergp I dont know the specifics it would involve to update the pusher, but here is a PR when we did it last time and seems like it was fine without any specific testing and it was fine. So I would check if there are any specific breaking changes that should worry us and try to update it. However, we are already on the latest officially stable version 8.3.0 https://www.npmjs.com/package/pusher-js?activeTab=versions the next version 8.4.0 is still a release candidate.

mountiny avatar Nov 26 '24 11:11 mountiny

Hmm… Maybe I'm confusing versions of other things. @quinthar Does this happen when you're using a poor connectivity setting in Dev Tools? Just trying to narrow down possible causes…

deetergp avatar Nov 27 '24 07:11 deetergp

@deetergp, @trjExpensify Eep! 4 days overdue now. Issues have feelings too...

melvin-bot[bot] avatar Dec 02 '24 09:12 melvin-bot[bot]

No I never use that setting. However, I do often work on poor networks.

quinthar avatar Dec 02 '24 16:12 quinthar

More discussion happening here and it sounds like we are leaning toward carving out a mini project to add our own Pusher ping to ensure the channel is open.

deetergp avatar Dec 04 '24 07:12 deetergp

@deetergp, @trjExpensify Huh... This is 4 days overdue. Who can take care of this?

melvin-bot[bot] avatar Dec 09 '24 09:12 melvin-bot[bot]

More discussion happening here and it sounds like we are leaning toward carving out a mini project to add our own Pusher ping to ensure the channel is open.

That sounds like a good idea! Will you create a new tracking issue for that, or use this one?

trjExpensify avatar Dec 09 '24 10:12 trjExpensify

That sounds like a good idea! Will you create a new tracking issue for that, or use this one?

I think maybe a new tracking issue might be the right call. I'll make a new one and close this one in favor of that.

deetergp avatar Dec 09 '24 17:12 deetergp

Going on hold until we have an application layer PING dedicated to Pusher. When we do, we can either fix actionable connection failures we find, or close this because a consistent application layer PING keeps pusher working well.

muttmuure avatar Dec 10 '24 13:12 muttmuure

@deetergp before I add AutoAssignerNewDotQuality to the issue this is held on, do you want to be assigned to that one too since you've already one done research?

mallenexpensify avatar Dec 10 '24 21:12 mallenexpensify

@mallenexpensify @muttmuure Let's put this one out there and see if someone else wants it. I've got Vacation Delegates and Improving System Messages in my immediate future, but if this is still out there when I get those in a happy place, I'll snag it.

deetergp avatar Dec 12 '24 01:12 deetergp

I'm going to take this and begin looking into it.

tgolen avatar Dec 18 '24 15:12 tgolen

Daily Update

I looked into some of these logs, and I found that the logs for [email protected] are actually quite abnormal when it comes to the 1006 error. I compared them with some other accounts ([email protected] logs and also [email protected] logs).

David's logs contain error messages from Pusher like:

  • Read error: ssl=0xb400007bf6708a18: I/O error during system call, Software caused connection abort
  • Software caused connection abort
  • Connection interrupted (undefined)

These errors are virtually non-existent from the other two accounts.

I have opened up a support case with Pusher to see if they can help diagnose and understand these errors better and see if we can find a fix.

Next Steps

  • @tgolen Work with Pusher support to understand the errors more
  • @tgolen begin investigating the abnormal errors to see what other users are experiencing them

ETA

  • TBD

tgolen avatar Dec 18 '24 19:12 tgolen

Daily Update

  • I haven't gotten any response from Pusher
  • I submitted another support request to them, this time from infra@expensify, which is the email attached to the account (yesterday I had used my own email)
Contents of Email

Hello,

Our Expensify application has users experiencing many connection errors that return with the error code 1006. I have been trying to debug these errors more and I am reaching out for your help and guidance.

I looked at our logs for three users:

The first two accounts have normal looking logs. When the 1006 error happens, there is no "message" property in the error and their clients are able to reconnect right away.

The last account has abnormal looking logs. The 1006 error has a "message" property with things like:

  • Read error: ssl=0xb400007bf66e7358: I/O error during system call, Software caused connection abort
  • Connection interrupted (undefined)
  • Software caused connection abort

These messages are virtually non-existing in the first two accounts, but it's not the only account they are happening on. When these errors happen, the client is NOT able to reconnect to pusher and they stay disconnected and aren't able to receive any more pusher messages.

I have gathered the logs for the last couple of days and put them into this Google doc which you can see for further analysis: https://docs.google.com/document/d/1TYUsq2IYaKiNTAWE89LH5Cmq2WLbBeRQ-3sCp--goZY/edit?tab=t.0

Can you please help me diagnose these messages further? Thank you, Tim

Next Steps

  • @tgolen Wait for a reply from Pusher and try to work with them to understand the errors

ETA

  • TBD

tgolen avatar Dec 19 '24 16:12 tgolen

Daily Update

  • I still haven't gotten any response from Pusher
  • I found a contact form on their site for "Premium Support" and I filled out the contact form to see if that will wake someone up

Next Steps

  • @tgolen Keep trying to engage with Pusher and contact them

ETA

  • TBD

tgolen avatar Dec 20 '24 16:12 tgolen

Reckon we won't hear back soon cuz of the holidays. I don't see an email in the cell for Relationship Owner (External) on the VML. Cole's the relationship owner, if we're still getting 👻 from Pusher, he can likely help light a 🔥 in the new year.

mallenexpensify avatar Dec 23 '24 16:12 mallenexpensify

@tgolen, @trjExpensify Uh oh! This issue is overdue by 2 days. Don't forget to update your issues!

melvin-bot[bot] avatar Dec 24 '24 09:12 melvin-bot[bot]