django-push-notifications icon indicating copy to clipboard operation
django-push-notifications copied to clipboard

issues with apns_async

Open valhallen13 opened this issue 11 months ago • 3 comments

Hi,

We send notifications to thousands of users, and there are a lot of TimeoutErrors being thrown by the library. I have looked through your code quite a bit and have noticed 2 strange things.

  • apns_async.py:383 the timeout is hardcoded to 1 second. I think this should be a property that can be set in the settings.
  • apns_async.py:336 i dont see the point of raising an error, because the errors are already available in the "results" variable. Raising an error just interrupts the whole flow, and does not even return the results variable which would allow us to mark users that have already received the message.

Furthermore, sending to thousands of users means that there will always be at least one registration_id that will be Unregistered, so the code will throw an error every single time. I dont think that having to run this code in a perpetual try:except should be the default way to implement it. It would work much smoother if you do not raise the error on line 336, or if you return the entire list of results with the exception so at least we can mark the successful users.

valhallen13 avatar Jan 08 '25 11:01 valhallen13

About (1) (hardcoded timeout) - would it make sense for you to have option of overriding _create_client() or is that overkill?

About (2) (raising errors) - if I remember correctly raising errors is how apns.py (non-async) version works because of thats how apns2 handles things. I agree with you that it's cumbersome when handling bulk requests. So you think not raising error when we get unsuccessful result is the best way?

pomali avatar Feb 13 '25 13:02 pomali

I would like to +1 for the mentioned issues, especially the second one. What we are seeing is every time there is at least one unregistered user we get an exception from send_message and no way of knowing which users got the message.

Essentially if everything was successful we get results dict with "Success" state for all and in any other case we don't have access to the results dict but instead a single error message with concatenated errors which we can't even match to the registration/user. What is the purpose of results if we can access it only when EVERYTHING is successful?

The way I see it we need to:

  1. Remove raise APNSError because it breaks the flow and completely disrupts usefulness of returning separate result for each user.
  2. Update conditions because sync version of bulk message checks only for Unregistered message, while async version checks for Unregistered, BadDeviceToken, and DeviceTokenNotForTopic.

I read the comments on the original issue #744 and what @pomali said about apns2 handling things that way, but I just don't get the reasoning for raising an error and what exactly apns2 does that way?

igor-wl avatar Apr 11 '25 13:04 igor-wl

+1 on @valhallen13 remarks, exceptions are "exceptions", in this specific case of having 1 unregistered device or one Timeout in a big batch is fairly common. I'd err on the side of returning a "report" object without raising an exception.

s4cha avatar Oct 13 '25 07:10 s4cha

hi all, thanks for the insights. i've been following this issue for a while and actually had a branch with a fix for this earlier this year.

Given this is a production issue affecting anyone using apns_async at scale, i'll be making some changes and sending out small successive patches to resolve these issues in order:

  1. The timeout configuration issue
  2. The error handling/flow disruption issue

+1 to @igor-wl's remarks about the error handling. I believe we should adopt the approach of not raising exceptions for partial failures and instead return a comprehensive result object.

Currently, the code has an errors list that just takes descriptions. My proposal is to return a tuple of (results, errors):

  • results: The existing dict with success/failure status for each device
  • errors: A list of error objects, where each object contains:
    • registration_id: The device token that failed
    • error_type: The specific error (e.g., "Unregistered", "BadDeviceToken", "Timeout")
    • error_message: Descriptive error message from the exception
    • timestamp: When the error occurred (optional but useful for debugging)

If there are no errors, errors becomes an empty list. This way, a single device failure doesn't kill the entire batch - which is the critical flaw in the current design.

I'll be working on the fixes and will submit PRs shortly this week. Looking forward to your feedback!

PS: let me know your thoughts on this @jamaalscarlett

cc @s4cha @valhallen13

50-Course avatar Dec 14 '25 20:12 50-Course