node-gcm Support for sending to > 1,000 registration tokens

As discussed in #42 with @hypesystem, we'd like to be able to support more than 1,000 registration tokens in the sender.send and sender.sendNoRetry calls -- today, sending more than 1,000 registration tokens will invoke an error, as GCM will reject the request.

Modules

parallel-batch, developed by @hypesystem
async, a popular parallelization module
some other parallel batching solution

Things to consider

What happens when one of the batch requests fails for some reason?
- I think we should retry that request if we've been invoked via sender.send.
- Does GCM send out notifications to some devices in the batch request even though it failed?
- In case the request still fails X number of times (amount specified or the default for sender.send), I think we should forget about that batch and keep going -- the other ones might not error out.
- After completing all other requests, we should pass an error to the provided callback, with info on how many batch requests failed, for example, to let the application know.
Are there legitimate cases where the batch request may fail and we could do something to make it succeed?
- We should review the GCM error response codes to see if there are errors that could be prevented, such as sending a malformed registration token, that could cause the whole batch request to fail.

Oct 17 '15 09:10 eladnava

@hypesystem did you get a chance to take a look at this, mate?

Oct 29 '15 20:10 eladnava

Not yet. The biggest challenge, as I see it, is if some batches succeed and other fails: this needs to be returned in a meaningful way to the user.

I'll give an example with some smaller sizes: we want to send 6 notifications and have a batch size of 2. This happens:

Batch 1 succeeds, all notifications being sent
Batch 2 has one succeeding and one failing notification. One of the notifications failed to send, and got an error back from GCM.
Batch 3 failed altogether: after three retries it still could not get a non-500 code from the GCM server.

How do we represent the result here?

We cannot just return an error, because some of the notifications were sent. The user needs to see this, and not resend them.

The best suggestion I have is to return success and a result object looking something like this:

[
  { success: true },
  { success: true },
  { success: true },
  { error: "original error that happened" },
  { error: "some error we made up for 500-codes" },
  { error: "some error we made up for 500-codes" }
]

For failing batches we return a result-object for each notification that explains the error.

Again, this is my best suggestion at the moment, and there might be a better solution. No matter what it is a tricky situation to handle.

Oct 30 '15 11:10 hypesystem

@hypesystem what do you mean by notification?

For failing batches we return a result-object for each notification that explains the error.

Are you referring to each individual notification sent to every device in the current batch?

Oct 30 '15 14:10 eladnava

By notification I mean the individual message sent to a device. I should have probably just called it "message" to lessen confusion.

Are you referring to each individual notification sent to every device in the current batch?

Yes. And also the final response when merging all the results before returning the result to the user.

Oct 31 '15 09:10 hypesystem

@hypesystem OK, that makes sense.

This is the format of the current result object returned in the sender.send callback:

{
    "failure" : 1, // Number of messages that could not be processed
    "success" : 5, // Number of messages that were sent successfully
    "canonical_ids": [...],
    "results": [...],
    ...
}

Given that apps already interface with it, I think that we should not modify its format. Batch requests should return the same object, incrementing the failure count or erroring out completely if necessary. And they could concat the results and canonical_ids from all of the batch requests into that single array which is returned to the app, in the same order that the registration tokens were supplied.

GCM Error Response Handling for Batch Requests

I think that we need to define what "batch request fails" means. There are 4 types of possible responses from the GCM downstream HTTP API.

This is how I think we should handle the HTTP response codes for each batch request:

400 Bad Request - bad JSON sent. No retry mechanism for the current request, as we'd be generating the same bad JSON again. Skip the current batch and send the next one, incrementing the failure count with the number of registration tokens in the current batch.
401 Unauthorized - bad API key. Error out completely, don't try any other batches, as this error is not specific to the current batch.
5xx Internal Server Error - GCM error, possibly a timeout on their end. We should retry the current batch while honoring the exponential back-off requirement. If we run out of tries and still can't get the message to be sent, this is where things get tricky. I think we should skip the current batch and execute the next one, incrementing the failure count with the number of registration tokens in the current batch that failed. It beats erroring out and not sending any other batches, in my opinion.
200 OK + results[] - similar behavior to how things work today. Inspect the results[] array and make a list of failed registration tokens from all the batch requests. When we're done with all the batches, try to re-send to those failed registration tokens (splitting those up into batches as well if necessary). Again, increment the failure count as necessary to reflect the number of failed messages for all batch requests.

As per your example, the result returned should be:

{
    "failure" : 1, // Number of messages that could not be processed
    "success" : 5, // Number of messages that were sent successfully
    "canonical_ids": [...],
    "results": [...],
    ....
}

What do you think?

Oct 31 '15 14:10 eladnava

Checking the error responses is nice but what if the request times out? Should it be considered like a failure, even though the messages may have gone through? I suppose the same issue exists for submissions <= 1000 requests as well though.

Aug 05 '16 06:08 rraziel

I think a timeout should just result in a retry; if it times out all the tries, it will fail altogether for all the affected registration ids.

Aug 08 '16 20:08 hypesystem

Hi, it's not available ?

Jul 15 '17 15:07 BackPackerDz

@BackPackerDz No, it is not. Check out a workaround here: https://github.com/ToothlessGear/node-gcm/issues/42#issuecomment-145379922

Jul 15 '17 18:07 eladnava

node-gcm node-gcm copied to clipboard

Support for sending to > 1,000 registration tokens

Modules

Things to consider

GCM Error Response Handling for Batch Requests

node-gcm
node-gcm copied to clipboard