issues with apns_async
Hi,
We send notifications to thousands of users, and there are a lot of TimeoutErrors being thrown by the library. I have looked through your code quite a bit and have noticed 2 strange things.
- apns_async.py:383 the timeout is hardcoded to 1 second. I think this should be a property that can be set in the settings.
- apns_async.py:336 i dont see the point of raising an error, because the errors are already available in the "results" variable. Raising an error just interrupts the whole flow, and does not even return the results variable which would allow us to mark users that have already received the message.
Furthermore, sending to thousands of users means that there will always be at least one registration_id that will be Unregistered, so the code will throw an error every single time. I dont think that having to run this code in a perpetual try:except should be the default way to implement it. It would work much smoother if you do not raise the error on line 336, or if you return the entire list of results with the exception so at least we can mark the successful users.
About (1) (hardcoded timeout) - would it make sense for you to have option of overriding _create_client() or is that overkill?
About (2) (raising errors) - if I remember correctly raising errors is how apns.py (non-async) version works because of thats how apns2 handles things. I agree with you that it's cumbersome when handling bulk requests. So you think not raising error when we get unsuccessful result is the best way?
I would like to +1 for the mentioned issues, especially the second one. What we are seeing is every time there is at least one unregistered user we get an exception from send_message and no way of knowing which users got the message.
Essentially if everything was successful we get results dict with "Success" state for all and in any other case we don't have access to the results dict but instead a single error message with concatenated errors which we can't even match to the registration/user. What is the purpose of results if we can access it only when EVERYTHING is successful?
The way I see it we need to:
- Remove
raise APNSErrorbecause it breaks the flow and completely disrupts usefulness of returning separate result for each user. - Update conditions because sync version of bulk message checks only for
Unregisteredmessage, while async version checks forUnregistered,BadDeviceToken, andDeviceTokenNotForTopic.
I read the comments on the original issue #744 and what @pomali said about apns2 handling things that way, but I just don't get the reasoning for raising an error and what exactly apns2 does that way?
+1 on @valhallen13 remarks, exceptions are "exceptions", in this specific case of having 1 unregistered device or one Timeout in a big batch is fairly common. I'd err on the side of returning a "report" object without raising an exception.
hi all, thanks for the insights. i've been following this issue for a while and actually had a branch with a fix for this earlier this year.
Given this is a production issue affecting anyone using apns_async at scale, i'll be making some changes and sending out small successive patches to resolve these issues in order:
- The timeout configuration issue
- The error handling/flow disruption issue
+1 to @igor-wl's remarks about the error handling. I believe we should adopt the approach of not raising exceptions for partial failures and instead return a comprehensive result object.
Currently, the code has an errors list that just takes descriptions. My proposal is to return a tuple of (results, errors):
results: The existing dict with success/failure status for each deviceerrors: A list of error objects, where each object contains:registration_id: The device token that failederror_type: The specific error (e.g., "Unregistered", "BadDeviceToken", "Timeout")error_message: Descriptive error message from the exceptiontimestamp: When the error occurred (optional but useful for debugging)
If there are no errors, errors becomes an empty list. This way, a single device failure doesn't kill the entire batch - which is the critical flaw in the current design.
I'll be working on the fixes and will submit PRs shortly this week. Looking forward to your feedback!
PS: let me know your thoughts on this @jamaalscarlett
cc @s4cha @valhallen13