grocer Socket connection drops next notification if invalid notification is sent

So, this is the problem I ran into on Lead Zeppelin (https://github.com/geoloqi/lead_zeppelin). It's kindof a big problem right now, and I haven't figured out a graceful way to fix it yet.

If you send an illegit message (such as a message with a bad device token), the connection will drop without throwing an exception for the next message, and that next message will effectively be dropped.

Example:


pusher = Grocer.pusher(certificate: "/path/to/validcert.pem")
notification = Grocer::Notification.new(device_token: "valid device token", alert: 'this will work')
pusher.push(notification)
notification = Grocer::Notification.new(device_token: "bad device token", alert: 'this will fail, as expected')
pusher.push(notification)
notification = Grocer::Notification.new(device_token: "valid device token", alert: 'THIS WILL UNEXPECTEDLY FAIL WITH NO EXCEPTIONS OR WARNINGS')
pusher.push(notification)
notification = Grocer::Notification.new(device_token: "valid device token", alert: 'this will work because an exception will be thrown, reconnecting the socket')
pusher.push(notification)

I don't know what the source of this problem is.. whether it's something in Ruby, something on Apple's APNS servers, or some sort of buffering issue, or what.

I worked around this problem by putting in an IO.select to wait for a specific period of time, to see if the socket becomes available for reading. With the enhanced protocol, being readable indicates that there was an error. But the only way to do this is to wait for a response before using it again, due to the silent failure. But this solution sucks, because it slows everything down.

You can see the implementation of that IO.select here: https://github.com/geoloqi/lead_zeppelin/blob/master/lib/lead_zeppelin/apns/gateway.rb#L96

The solution I was pondering to deal with this problem was to switch back to the version 0 protocol, but then of course you don't get the enhanced error response, which also kindof sucks.

The real solution would be for APNS to send a confirmation bit like they should, or drop the connection correctly... but that's not going to happen anytime soon. Really at a loss for how to solve this.

Jun 22 '12 23:06 kyledrake

My guess is that this issue is being caused by the fact that you're using typical non-blocking sockets, which are implemented a bit weirdly in Ruby by default. I ran into a similar experience, and switching to kgio (http://bogomips.org/kgio/) resolved it. Voodoo, I know -- but what works, works.

Jun 23 '12 03:06 kenkeiter

I was thinking of trying kgio to see if it helped, I will try this. Thank you!

On Friday, June 22, 2012, Kenneth Keiter wrote:

My guess is that this issue is being caused by the fact that you're using typical non-blocking sockets, which are implemented a bit weirdly in Ruby by default. I ran into a similar experience, and switching to kgio ( http://bogomips.org/kgio/) resolved it. Voodoo, I know -- but what works, works.

Reply to this email directly or view it on GitHub: https://github.com/highgroove/grocer/issues/14#issuecomment-6522121

Jun 24 '12 04:06 kyledrake

KGIO does not appear to resolve this situation (on Lead Zeppelin at least). I have tried this on JRuby and got the same results. After testing 3 different SSL socket implementations, I'm starting to arrive at the conclusion that there is no way to fix this except to introduce an IO.select and wait an arbitrary amount of time for an error before using the connection again. Again, if people more apt at socket programming see an obvious problem here, feel free to chime in, it's entirely possible that we're just not doing some cryptic thing that we're supposed to do.

Implementing the 0 version of the protocol is another option on the table that might resolve this problem (at the expense of better error reporting).

Jun 24 '12 18:06 kyledrake

I had this kind of error using apn_on_rails , I didn't find out a good solution too. Used to rescue the error with the device id of the bad token and reopen a connection and start with the next device.

I'm curious about what you'll find out.

Jul 26 '12 20:07 netbe

The only implementation that does this correctly AFAICT is Lead Zeppelin, which uses IO.select to wait an arbitrary time to see if an error comes back. This makes it able to send less messages though.

My one and only idea for solving this would be to try to implement the first version of the protocol (0) and see if the problem goes away.

I'm really surprised that I'm the only person that has brought this up before.. I found this problem immediately with very basic testing.

Jul 27 '12 21:07 kyledrake

I'm running into this as well. Any solutions in the past three months?

Oct 12 '12 12:10 kevboh

Nothing has been attempted yet. My only idea so far is to try to inplement the original version of the protocol and see if it fixes it. I have not had the time to work on this unfortunately.

I implemented an apns solution around the same time called lead_zeppelin that uses an arbitrary io.select wait time, which you can find in my repository. Until a better solution is found, that small wait solves the problem, and you can probably implement something like that with a sleep of the same length with grocer. Not ideal of course, but probably manageable if you're not sending a ton of messages.

FWIW, every apns implementation i've tried has had the exact same problem. It's not a problem specific to Grocer. I would more describe it as an unfortunate consequence of the implementation on Apple's end.

I want to try to implement the original protocol with grocer to see if it fixes it. I think that's the next action here.

On Friday, October 12, 2012, kevboh wrote:

I'm running into this as well. Any solutions in the past three months?

— Reply to this email directly or view it on GitHubhttps://github.com/highgroove/grocer/issues/14#issuecomment-9374344.

Oct 12 '12 12:10 kyledrake

Sounds good. I'll check out lead_zeppelin.

Oct 12 '12 12:10 kevboh

for #43 I introduced 2 different IO.selects

select on read and write after every call. This will allow you to not block, but find errors as quickly as they are found
select on read only once you are done. This will ensure any errors raised will be found. (you will still need to go back and send any notifications after the error was found)

I tried sleep(1) before reading, but that sometimes waited too long, so the apple connection was closed and I couldn't get the error information.

Mar 13 '13 19:03 kbrock

@vanstee @kbrock @kyledrake @aleph7 I want to try to clear some things up, get us all on the same page, and then propose a possible, not-waiting-after-every-write, solution.

TL;DR: avoid waiting after every write and move to a producer/consumer setup.

The problem(s)

Simple notifications: Apple sends no errors back when/if a notification fails. They just silently close the connection. So that stinks.
Enhanced notifications: Apple sends errors back only for notifications that fail. They send nothing for successful notifications.
This applies to both of the above: The connection to APNS is asynchronous, so we can't assume that the notification sent right before the connection closes is the one that was bad. For Simple notifications, we're out of luck. For Enhanced, we could leverage the optional :identifier bytes to know which notification failed.

Current approaches

The approaches so far all use an IO.select to wait an amount of time after every write, just in case it failed. This is a huge hit to performance and also, b/c the connection is async, doesn't actually solve the problem of knowing which notification failed as we could have gone over our wait timeout and sent another message, but we'd not yet received the error from the previous one.

An alternative approach

What if we used a couple of thread-safe queues and the producer/consumer pattern? So there is a queue of to_be_sent notifications and a queue of sent notifications. We'd mark each notification with an :identifier before sending it, then immediately push it into the sent queue. Then, use an async read to listen for any errors.

When an error is found :identifier in the error message lets us know which notification was bad. So anything ahead of it in the sent queue can be assumed to be "good." Everything behind it needs to be re-enqueued in the to_be_sent queue, and the bad notification can be discarded (or collected so the client code can act accordingly).

For more background with better descriptions of the problems, current approaches, and even this kind of queue-based alternative, check out this post: http://redth.info/the-problem-with-apples-push-notification-ser/

Mar 13 '13 23:03 stevenharman

Hi @stevenharman, @vanstee, @kyledrake, and @aleph7

We are all on the same page: It is not a viable option to block after every push.

But please keep in mind that IO.select does not necessarily block.

One implementation

You can find the actual implementation in our copy of pusher.rb.

notifications.each do | notification |
  push notification
  if check_for_issues_non_blocking
    rewind_to notification_after_error
  end
end
if check_for_issues_blocking(1.second)
    rewind_to notification_after_error
end

Having the 1 second block at the very end allows us to get the response code from Apple. We can confidently clear out all state once this method is finished.

Also, if we were not blocking at the end and Apple did send a response code. We may not come into the method for another few seconds. Apple will have already closed the socket and the Apple response code will have been lost.

We can use `IO.select` for blocking and non blocking requests:

IO.select takes 4 parameters: read, write, error, and timeout.

For our purposes, there are 2 ways we can call it:

IO.select [@ssl], [], [@ssl], 1
IO.select [@ssl], [@ssl], [@ssl], 1

First example

You have 1 second to see if we can read from the socket (or if there are socket errors). This is the blocking case: check_for_issues_blocking(1.second)

Second example

You have 1 seconds to see if we can read OR write from the socket (or if there are socket errors). Since we can basically always write to the socket, it will not block. Or if it did, it would have blocked when we wrote to the socket anyways. So no loss. When there is an issue from apple, IO.select will tell a socket has an error packet on it, and we can read from the socket. As a side, IO.select always seems to say we can write to the socket, even after Apple has written to the socket and is about to close it. This is the non-blocking case: check_for_errors_non_blocking

You can see both implementation in our copy of ssl_connection.rb

What did I miss something?

Most of the implementations I've seen use the first example. And since this blocks, most developers have assumed that the code has to block.

Anyone know if there is a technical reason why developers have only been using the first example instead of leveraging both?

Mar 14 '13 03:03 kbrock

@stevenharman Totally agree with your "alternative approach". Seems like our best option. Although, we'll need a separate queue for the simple notifications since we wouldn't be able to figure out where in the queue that we failed.

Any thoughts on also providing a blocking API as well?

@kbrock I think blocking code is probably less complex which might be a good argument for it. For that "second example" are you saying, that we can call select on the connection for both read and write, knowing that the write connection will immediately available, and then just check the errors for the response from Apple? That's pretty clever, although I'm still leaning towards doing a blocking read in another thread so far.

Mar 14 '13 05:03 vanstee

@vanstee - Second example does not block. And that is what I am currently using after every write. That is the code I am now using in production.
The "alternative approach" could be nice and simple.

I think the main issue is the secondary thread needs to communicate with the primary thread to say: you need to close the connection and you need to rewind back and send those alerts again.

Also, after leaving the producer, the secondary thread will not have a primary thread to tell that we need to rewind/ send the alerts again.

Maybe it is a 3 threaded model A secondary apple error thread A third push notification to apple thread The primary thread just adds notifications to the third apple thread.

I guess I have tended to avoid the threaded producer/consumer code when possible. It just seems to have so many synchronization and possible race conditions.

Mar 14 '13 06:03 kbrock

I think introducing threads overcomplicates things. Something simple like this should suffice

notifications.each do |notification|
  write_blocking notification
  reply = read_nonblocking
  if reply
    handle_reply(reply) # use identifier to know where to resume
    return
  end
end
reply = read_blocking(timeout: 1)
handle_reply(reply) if reply

We read without blocking after each write. And at the end of the loop we do a blocking read so that we don't miss any late replies. If there is a reply we go back to the identifier that failed.

In the case of "simple notifications" we would just resume where we left, which will potentially miss some notifications. Nothing to do about that.

Am I missing something?

Mar 14 '13 06:03 alejandro-isaza

In the case of "simple notifications" we would just resume where we left, which will potentially miss some notifications.

Wasn't exactly sure what this meant.

The non-threaded example still has the potential for missing messages, so we would have to do the same history queue thing for that example as well.

Mar 14 '13 22:03 vanstee

I don't see why it has potential for missing messages. With messages you mean notifications or replies? And are you referring to the "simple" or the "enhanced" case?

Mar 14 '13 22:03 alejandro-isaza

Both would still have the potential for sending notifications to Apple after they have decided to no longer receive data on the socket due to an invalid notification. Since we're using Nagles algorithm to combine packets sent over the wire (as recommended by Apple), there's the potential for Apple to receive a few notifications from us before they have checked the validity of the first notification.

Here's a quick example:

Let's say we send 3 notifications, but the first one is invalid. We'll send all three down the pipe. Apple will start inspecting them and forwarding them to the correct devices. However the first notification has an invalid token so Apple sends us an error response, closes the connection, and throws away the other messages we've already sent. So we would still have to check the identifier in the error response and rewind to that position to make sure we didn't lose any messages.

I'm totally not an expert on networking stuff, so I might be misunderstanding how this actually works. Let me know if I missed anything!

Mar 15 '13 14:03 vanstee

That is why you use the identifier to know which notification failed and restart from there. See the actual implementation in my comment on https://github.com/grocer/grocer/pull/21#issuecomment-14913962

Mar 15 '13 16:03 alejandro-isaza

I'm running an algorithm similar to the pull request 21 algorithm 10-20 times. Playing with some different variables. We introduce an error (by sending a bad token) and end up sending 2 to 4 messages after the bad request until apple responds.

We have yet to loose a packet or send a duplicate.

Do note: when I introduced a sleep instead of a blocking read at the very end, I did get errors without a response code. So I feel strongly about that.

Mar 15 '13 18:03 kbrock

One thing to keep in mind - the producer/consumer model has a very nice, and desirable, side benefit of also being thread safe. The read-after-each-send-and-once-more-at-the-end is not, in and of itself, thread safe.

I'm not exactly sure what the implementation would look like, but I know that it really helps to keep threaded code simple and concise - so that would be a guiding principle.

Mar 15 '13 22:03 stevenharman

@stevenharman

There is a notification sender, and a notification error receiver. Both of which need to act upon a single socket.

In the read after write case - both read and write and close are happening on the primary thread by the pusher. In the threaded case, read is happening by a secondary thread, and write/close is happening by the primary thread.

In both cases, the pusher can only be used by a single thread.

Personally, I'm running this in event machine and as expected, the blocking on write to apple is not working for me.

Mar 18 '13 14:03 kbrock

I should have been more clear - I meant that from the consumer side, this would make Grocer::Pusher thread-safe.

I am making the assumption that the send queue is thread-safe (which Ruby's Queue is), so you could get a Grocer.pusher instance and share that across threads. Then each client thread can safely push new Grocer::Notification objects via the Grocer::Pusher instance, which will just push them onto the queue.

As you've said, there is a single sender thread which only writes and closes the socket. So yes, we are only sending via one thread, but those writes are small and fast, and by using the single connection for all writes we save a huge amount on the overhead of opening/closing the sockets.

Does that make sense, or am I misunderstanding, confused, or don't-know-what-I'm-talking-about? (I would not be offended if you said the later. :smile: )

Mar 18 '13 15:03 stevenharman

Regardless of how it comes out, a non-threaded as an option would be appreciated. Our use-case for Grocer is via Sidekiq, so using threads with my threads feels like a fragile house full of cards. We instantiate a separate Pusher for each thread, so the overall thread-safety of the class isn't a concern.

Mar 22 '13 15:03 jmoses

We instantiate a separate Pusher for each thread, so the overall thread-safety of the class isn't a concern.

Depending on how you're sending messages, you need to be careful of this as Apple will can consider many short-lived connections to be a DDoS and shut you down.

Mar 22 '13 15:03 stevenharman

Yeah, I saw that in some of the documentation, forums. The threads are long lived, and the sockets don't look like they'll timeout on their own (the pusher never closes them as a surprise, from what I can see). We also limit our concurrency to that particular Sidekiq instance to 15.

--edit Not to highjack the issue.

Mar 22 '13 15:03 jmoses

Completely reworked #43. Needed higher throughput

No longer using:

def push_batch notifications
  while notification = notifications.pop
    push(notification)
    if non_blocking_read
      notifications << notifications_to_resend
    end
  end
  blocking_read
end

Now using:

def push_and_check(notification)
  remember_notification(notification)
  push(notification)
  if non_blocking_read
    push(remembered_notifications) #possibly from previous calls to push_and_check
  end
end

Mar 27 '13 04:03 kbrock

Problem is still here. So any workaround?

Sep 03 '13 10:09 nusov

Ran into this problem as well. For example, I used one connection to send notification for batch of 10 device token. And found out the message never reached most of them, but if I initilize connection for each send action. It will be arrived. I think it's same cause here. Any workarounds for this issue right now?

Oct 28 '13 03:10 pzgz

@stevenharman is this issue fixed in Grocer or is it up to the client code to implement the producer/consumer solution?

Jan 07 '14 02:01 rpc4mv

There hasn't been any change to Grocer - I've become quite busy with other projects, as has @vanstee. I still believe an internal producer-consumer is the way to go, but I won't have time to work on such a thing until I (one day) get back to using Push Notifications. However, I'd be happy to review, provide feedback, and merge such a change if you (or anyone else) wants to work on it.

Or I'd even be OK with going with one of the other approaches, perhaps the one @kbrock is using as he's been running with it, successfully, for some time.

Jan 07 '14 03:01 stevenharman

grocer grocer copied to clipboard

Socket connection drops next notification if invalid notification is sent

The problem(s)

Current approaches

An alternative approach

One implementation

We can use IO.select for blocking and non blocking requests:

First example

Second example

What did I miss something?

grocer
grocer copied to clipboard

We can use `IO.select` for blocking and non blocking requests: