autopush
autopush copied to clipboard
Re-subscribing to a chid that was recently unsubscribed gives code 410 errno 103
In Android Components, we recently landed a change that attempts to fix an expired FxA push subscription when the FxA server sets the flag subscriptionExpired
.
The fix is to perform an "unsubscribe" and then a "subscribe" with the same chid. However, we're seeing logs where the subscribe fails with a code 410 errno 103 ("Expired URL endpoint") after a successful unsubscribe:
11-19 13:50:00.254 18810 18837 I ConstellationObserver: onDevicesUpdate triggered.
11-19 13:50:00.254 18810 18837 I ConstellationObserver: Proceeding to renew registration
11-19 13:50:00.254 18810 18837 I ConstellationObserver: We have been notified that our push subscription has expired; re-subscribing.
11-19 13:50:00.826 18810 18871 I ConstellationObserver: Un-subscribing successful: true
11-19 13:50:00.826 18810 18871 I ConstellationObserver: Subscribe call should give you a new endpoint.
11-19 13:50:00.827 18810 18837 I ConstellationObserver: Incrementing verifier
11-19 13:50:00.827 18810 18837 I ConstellationObserver: Verifier state before: timestamp=1605814852629, count=3
11-19 13:50:00.828 18810 18837 I ConstellationObserver: Verifier state after: timestamp=1605814852629, count=4
11-19 13:50:00.828 18810 18837 I FxaDeviceConstellation: Notifying observer about constellation updates.
11-19 13:50:00.828 18810 18837 I FxaDeviceConstellation: Executing: polling for device commands
11-19 13:50:00.947 18810 18871 W ConstellationObserver: Re-subscribing failed; FxA push events will not be received.
11-19 13:50:00.947 18810 18871 W ConstellationObserver: mozilla.appservices.push.CommunicationError: Error(Communication Error: "Unhandled client error 410 : \"{\\\"code\\\": 410, \\\"errno\\\": 103, \\\"error\\\": \\\"\\\", \\\"more_info\\\": \\\"http://autopush.readthedocs.io/en/latest/http.html#error-codes\\\", \\\"message\\\": \\\"Request did not validate UAID not found\\\"}\"")
11-19 13:50:00.947 18810 18871 W ConstellationObserver: at mozilla.appservices.push.RustError.intoException(RustError.kt:13)
11-19 13:50:00.947 18810 18871 W ConstellationObserver: at mozilla.appservices.push.PushManager.subscribe(PushManager.kt:13)
11-19 13:50:00.947 18810 18871 W ConstellationObserver: at mozilla.components.feature.push.RustPushConnection.subscribe(Connection.kt:4)
11-19 13:50:00.947 18810 18871 W ConstellationObserver: at mozilla.components.feature.push.AutoPushFeature$subscribe$$inlined$ifInitialized$lambda$2.invokeSuspend(AutoPushFeature.kt:5)
11-19 13:50:00.947 18810 18871 W ConstellationObserver: at mozilla.components.feature.push.AutoPushFeature$subscribe$$inlined$ifInitialized$lambda$2.invoke(Unknown Source:10)
11-19 13:50:00.947 18810 18871 W ConstellationObserver: at mozilla.components.feature.push.ext.CoroutineScopeKt$launchAndTry$2.invokeSuspend(CoroutineScope.kt:5)
11-19 13:50:00.947 18810 18871 W ConstellationObserver: at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:3)
11-19 13:50:00.947 18810 18871 W ConstellationObserver: at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:15)
11-19 13:50:00.947 18810 18871 W ConstellationObserver: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
11-19 13:50:00.947 18810 18871 W ConstellationObserver: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
11-19 13:50:00.947 18810 18871 W ConstellationObserver: at java.lang.Thread.run(Thread.java:923)
11-19 13:50:00.994 18810 18837 I FxaDeviceConstellation: Successfully executed: polling for device commands
This is the code that is executed to show the logs above.
A breakdown of the logs:
-
Un-subscribing successful: true
is when the Android code executes the native Rust component'sunsubscribe
. -
Subscribe call should give you a new endpoint.
is the native result coming back as true (as seen in the previous line). -
Re-subscribing failed; FxA push events will not be received.
is when receive aPushError
from the native component's wrapper.
Since we have received a result from the unsubscribe that it was successful, the subscribe call to the same chid should work as well provide us with a new endpoint.
Sentry (Nightly): https://sentry.prod.mozaws.net/operations/firefox-nightly/issues/10175063/
From a user, it seems like we're hitting this in non-FxA applications as well: https://github.com/mozilla-mobile/fenix/issues/15028#issuecomment-736834347
I wonder if this is connected a little bit to the behaviour in https://github.com/mozilla-services/autopush/issues/1444, in which I wrote:
I also believe that the current appservices push component would fail if its uaid record were to be discarded by the server, since I can't find any codepaths that would recover from such a state. But we haven't observed any devices that seem to be in such a state in the wild.
Perhaps this is now evidence of "devices that seem to be in such a state in the wild", since "UAID not found" is exactly the sort of error message I'd expect for a client whose uaid record had been discarded by the server.
FWIW, https://sql.telemetry.mozilla.org/queries/73067#183005 is showing send tab success nose-diving on Android - which prompted me to see how it works for me, and logcat shows what Ryan was getting at:
12-02 14:24:57.224 9884 9926 I ConstellationObserver: onDevicesUpdate triggered. 12-02 14:24:57.226 9884 9926 I ConstellationObserver: Proceeding to renew registration 12-02 14:24:57.226 9884 9926 I ConstellationObserver: We have been notified that our push subscription has expired; re-subscribing. ... 12-02 14:24:58.843 9884 9942 W ConstellationObserver: Un-subscribing failed; subscribe may return the same endpoint 12-02 14:24:58.843 9884 9942 W ConstellationObserver: java.lang.IllegalStateException: Un-subscribing with the native client failed. ... 12-02 14:24:59.057 9884 9942 W ConstellationObserver: Re-subscribing failed; FxA push events will not be received. 12-02 14:24:59.057 9884 9942 W ConstellationObserver: mozilla.appservices.push.CommunicationError: Error(Communication Error: "Unhandled client error 410 : "{\"code\": 410, \"errno\": 103, \"error\": \"\", \"more_info\": \"http://autopush.readthedocs.io/en/latest/http.html#error-codes\", \"message\": \"Request did not validate UAID not found\"}"")
https://sql.telemetry.mozilla.org/queries/73067#183005 is showing send tab success nose-diving on Android
If this started happening in the last week or so, it's probably because on Android we see the subscriptionExpired
flag on the FxA account and are trying to recover by unsubscribing first and then subscribing again to the same scope.
I'm guessing the send tab metrics may have been reporting it as a success but the messages never reached the device and silently dropped.
I'm guessing the send tab metrics may have been reporting it as a success
In theory, if the subscriptionExpired
flag is set on the FxA server when you try to send a tab to that device, the FxA metrics will report that as a failure rather than a success.
"UAID not found" is exactly the sort of error message I'd expect for a client whose uaid record had been discarded by the server.
On the other hand, if the uaid record had been discarded by the server, I would expect the call to unsubscribe
to fail; it's weird that calling unsubscribe
succeeds but then calling subscribe
immediately after it fails.
Yes, it's why I gave the bug the weight I did. There's going to be a bit of investigation to see exactly what's going on here and I'm not sure how easily I can replicate what's happened. The database appears to have gotten into a strange state.
If it's helpful, we're tracking the exceptions from the unsubscribe here in nightly here: https://sentry.prod.mozaws.net/operations/firefox-nightly/issues/10175063/
On the other hand, if the uaid record had been discarded by the server, I would expect the call to unsubscribe to fail; it's weird that calling unsubscribe succeeds but then calling subscribe immediately after it fails.
The sentry breadcrumbs tell us that it is in fact the unsubscribe call that's failing, which makes more sense.
I believe it's the responsibility of the rust code to handle the "missing uaid" case better, and have filed https://github.com/mozilla/application-services/issues/3751 to follow up.
Doing some research on this, I am not able to reproduce it locally. One kind of dumb thought is that if the CHID isn't included and the client sends a DELETE https://.../v1/fcm/.../registration/{uaid}
then the user's entire record is wiped and a 200 returned. Sending a subsequent, additional channel subscription request would fail because the record is no longer present. Not sure how that could happen, unless there's a builder function, since DELETE https://.../v1/fcm/.../registration/{uaid}/subscription/
would return a 404, but it's a thought.
In any case, I see that the number of subscription errors seems a lot lower recently.
In any case, I see that the number of subscription errors seems a lot lower recently.
@jrconlin I left a comment on that Sentry exception that I'll copy here:
This exception will go back down to zero since we've backed out the change that invokes the bad state on the service side. We still need to find an appropriate fix to reintroduce the change or fix the FxA push subscription another way.
We reverted our changes that found the error during the holiday break since we didn't have a solution.
We do still have devices that are stuck in this state as well.
Yeah, I get that. FWIW, I was working with someone outside trying to resolve an issue and was seeing something kind of odd. I cataloged it here: https://github.com/mozilla-mobile/android-components/issues/9426 but I wonder if it might be related? Basically, i wonder if firefox might be getting an old FCM id for some weird reason, which would get rejected by FCM on the server side, leading to the 410. Kinda gets back to the whole "FCM is a weird black box" and I'd love to get away from it, but that's just not happening anytime soon.
I also have a device that is stuck with a 410
for every subscription. Let me know if you want to run any tests.
I am contributing here, as many related issues are closed with reference to this issue.
Samsung Galaxy S9 / Android 10 / OneUI 2.5
-
Testcase 1 https://webpushdemo.azurewebsites.net/ -- Notifications are received on the device (status bar and notification drawer) -- Nothing happens when tapping the notification. Expected: The originating URL is opened in Firefox
-
Testcase 2 Send message through the Discourse messaging system -- Notification shows up after opening Firefox. Expected: No need to open Firefox to receive a Notification -- When tapping the notification I am taken to the main discourse website. Expected: The tap brings me to the message in discourse which generated the notification -- A second message did not fire a notification, even after opening Firefox. Expected: Each message should fire a notification -- Note: In Google Chrome Testcase 2 works as expected.
Please let me know if anything else is needed from my side or if I should file a new issue.
@wurstsemmel Thanks for this, but I think it would be better posted on the fenix repo.
This repo is the back-end service that sends the notifications to desktop and mobile devices. It would also be useful for the folk there if you could include the version of Firefox Android that you're using. ⋮ > Settings > About Firefox
Thanks!
Is this the same bug https://github.com/mozilla-mobile/fenix/issues/19152 My device logs (which I have just attached to that issue) have the same message response has no uaid
.
I am using Firebase Messaging Web library and had been repeatedly subscribing and toggling notification settings for testing, if that's relevant.
Happy to participate in any logging/debugging
I don't think this is the same issue - if I'm reading the client code correctly, the check for a uaid
happens after we've checked for a 401. A recent Fenix Nightly has some improvements here though (look for the about page listing an "AS" version of 86.2.0 or later)
I'd have to dig in to the code a bit. I can't think of a reason off hand that you should get a non-error registration response that doesn't have a UAID in it, unless there might be something that's causing the UAID value to be unreadable by the client (e.g. dashes can sometimes be a problem, if I recall. The code tries to return the same value format that it got, but something could be tripping it up.)