hyper icon indicating copy to clipboard operation
hyper copied to clipboard

Protocol error: Max outbound streams is 500, 500 open

Open binrush opened this issue 9 years ago • 67 comments

Hello. We are using hyper to send push messages using Apple Push Notification Service. Some times we get TooManyStreamsError: Max outbound streams is 500, 500 open when sending HTTP2 request.

The code we use is quite simple:

class APNS(object):
    def __init__(self, cert, key, dev=False, alt_port=False):
        self.server = 'api.development.push.apple.com' if dev\
            else 'api.push.apple.com'
        self.port = 2197 if alt_port else 443

        self._ssl_context = init_context(cert=(cert, key))
        self._conn = HTTP20Connection(
            host=self.server,
            port=self.port,
            force_proto='h2',
            ssl_context=self._ssl_context)

    def send(self, push_token, payload,
             id_=None, expiration=None, priority=None, topic=None):
        headers = HTTPHeaderMap()
        if id_:
            headers['apns-id'] = id_
        if expiration:
            headers['apns-expiration'] = expiration
        if priority:
            headers['apns-priority'] = priority
        if topic:
            headers['apns-topic'] = topic

        stream = self._conn.request(
            'POST',
            '/3/device/' + push_token,
            headers=headers,
            body=json.dumps(payload, cls=PayloadEncoder)
        )
        return stream

    def feedback(self, stream_id):
        with self._conn.get_response(stream_id) as resp:
            if resp.status != 200:
                body = json.loads(resp.read())
            else:
                body = None
            return (
                resp.status,
                resp.headers.get('apns-id', [None])[0],
                body)

We use gevent celery worker to send pushes with maximum of 50 greenlets, so as I understand, there should not be more then 50 opened streams in every moment because we are closing stream after getting response.

Please explain what does this error mean and how we can work around it?

binrush avatar Jul 31 '16 15:07 binrush

So there are two possibilities. One is that, while you might only allow 50 greenlets, the send greenlets might be starving out the feedback greenlets. This admittedly only happens if the size of the responses is > 64kB.

The other risk is that the non-200 errors aren't having their bodies consumed. Generally you should either read() the stream or reset it if you don't want it: otherwise, there's a risk of leaving the stream open.

However, at this point I think turning on logging is likely to be the most initially useful first pass. Do you mind enabling debug logging for me and seeing what the results produce?

Lukasa avatar Jul 31 '16 17:07 Lukasa

(Side note: this is some extremely well-factored code, I approve highly!)

Lukasa avatar Jul 31 '16 17:07 Lukasa

Thank you for reply and for hyper )

So there are two possibilities. One is that, while you might only allow 50 greenlets, the send greenlets might be starving out the feedback greenlets. This admittedly only happens if the size of the responses is > 64kB.

currently, we executing send() and feedback() in single celery task sequentially, so number of sends and feedbacks should be equal

However, at this point I think turning on logging is likely to be the most initially useful first pass. Do you mind enabling debug logging for me and seeing what the results produce?

I enabled debug for hyper and will post logs here when error will be reproduced

(Side note: this is some extremely well-factored code, I approve highly!)

Thank you!

binrush avatar Aug 01 '16 05:08 binrush

TooManyStreamsError was not reproduced yet, but I have some debug logs. There are a lot of messages like this:

2016-08-03 13:05:00,770:DEBUG:hyper.http20.connection:recv for stream 0 with set([1, <many values from 1 to 16293> ])  already present

The last number (16293) and "recv for stream" number (0) are not changed within a number of last messages in logs. Is it normal situation?

I uploaded the whole log to dropbox, it's quite big (500 kb).

https://www.dropbox.com/s/a9rtr511dthfxbc/pcelery_async-stdout---supervisor-eUkFtI.log.gz?dl=0

binrush avatar Aug 04 '16 03:08 binrush

That's somewhat interesting, but not that important in this case I think.

Those messages appear if a frame has been received on a given stream, but that stream has not yet attempted to read its frames. It exists as a guard against the possibility of having a stream stuck behind a read lock.

However, if a stream never has to emit a read: that is, if all the data that it was expecting got pushed to it by other streams reading, then it will never get evicted from that set. Its mere presence in that set is not normally that useful a debugging piece of information. It looks like we have a bug where we don't evict members from that set if a stream is cleanly closed, which would be good to fix (see #265). However, it's unrelated to this problem.

The fact that this is not reproduced regularly suggests, however, that it's a somewhat intermittent problem. Can you try regularly logging out the value of len(HTTP20Connection.streams)? That will help me get a feel for how many streams we think are open at one time.

Lukasa avatar Aug 04 '16 10:08 Lukasa

We enabled logging of streams count and currently maximum what we have is 12 streams:

2016-08-04 13:39:43,959:DEBUG:backend.push.push_apns2:Have 12 opened streams

It seems strange because it rather far from 500 ).

Is it possible that TooManyStreamsError shows incorrect number of max_outbound_streams?

binrush avatar Aug 05 '16 08:08 binrush

@binrush I doubt it. A quick glance at the relevant code doesn't seem wrong.

I wonder if this was only encountered in a weird error condition scenario.

Lukasa avatar Aug 05 '16 08:08 Lukasa

We decided to decrease grenlets number to 10. Currently we have following in logs:

2016-08-07 00:54:07,361:ERROR:celery.worker.job.on_timeout:Hard time limit (900s) exceeded for backend.tasks.async.send_push[378a60a9-54aa-460b-be66-79c875ff6239]
2016-08-07 00:54:07,647:ERROR:backend.push.push_apns2.apns_send_msg:Too many opened streams, 168
2016-08-07 00:54:15,376:ERROR:celery.worker.job.on_timeout:Hard time limit (900s) exceeded for backend.tasks.async.send_push[0e46e1c8-65fb-4d40-83ad-7fc9cfb4f0ef]
2016-08-07 00:54:15,683:ERROR:backend.push.push_apns2.apns_send_msg:Too many opened streams, 169

So, looks like streams number is much greater then greenlets number

May be execution hangs at some moment, then celery kills worker dut to timeout and stream is left opened?

binrush avatar Aug 07 '16 04:08 binrush

@binrush It's certainly possible. Do you have your "have opened streams" log for this as well?

Lukasa avatar Aug 07 '16 08:08 Lukasa

@Lukasa here it is https://www.dropbox.com/s/ly57q0vfq8r5jtt/%D0%9D%D0%B0%D1%87%D0%B0%D0%BB%D0%BE%20%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D1%8B%20%D1%81%20Dropbox.pdf?dl=0

But seems there is no any messages from hyper except for one I mentioned above

binrush avatar Aug 07 '16 08:08 binrush

Uh, I don't think that link is right.

Lukasa avatar Aug 07 '16 09:08 Lukasa

@Lukasa sorry, here is right one https://www.dropbox.com/s/4ibew32vllgqmsv/lastlog.tar.gz?dl=0

binrush avatar Aug 07 '16 09:08 binrush

Yup, timeouts do seem to be the problem here: it does seem like celery is timing those requests out and they never end up closed. It would be interesting to see why 900s are being taken for some of these requests.

Lukasa avatar Aug 07 '16 09:08 Lukasa

BTW, in celery task we have something like this:

            stream_id = connection.send(token, pl, topic=topic)
            status, apns_id, body = connection.feedback(stream_id)

So seems like execution stucks on getting feedback.

Do you have any suggestions on how we can futher debug it?

binrush avatar Aug 08 '16 03:08 binrush

@binrush Ideally it would be good to see a traceback from the point where celery killed the worker. Is that going to be available?

Lukasa avatar Aug 08 '16 08:08 Lukasa

@Lukasa currently, no. I'm trying to figure out how to get it and will write here if I have any success with it

binrush avatar Aug 08 '16 08:08 binrush

Unfortunately, I still did not fugure out, how to get stack trace for Hard time limit (900s) exceeded error, so I made a trace dump of all greenlets in moment when streams count became more then greenlet count. Most greenlets have following trace:

  File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent
/greenlet.py", line 534, in run
    result = self._run(*self.args, **self.kwargs)
  File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/celery/concurrency/gevent.py", line 33, in apply_timeout
    propagate=(Timeout, ), **rest)
  File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/celery/concurrency/base.py", line 36, in apply_target
    ret = target(*args, **kwargs)
  File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/celery/app/trace.py", line 349, in _fast_trace_task
    return _tasks[task].__trace__(uuid, args, kwargs, request)[0]
  File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/prince/proj/python/prince-backend/backend/api/api.py", line 256, in __call__
    return TaskBase.__call__(self, *args, **kwargs)
  File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/prince/proj/python/prince-backend/backend/tasks/async.py", line 36, in send_push
    extra={'prince': msg.body})
  File "/home/prince/proj/python/prince-backend/backend/push/push_apns2.py", line 86, in apns_send_msg
    stream_id = connection.send(token, pl, topic=topic)
  File "/home/prince/proj/python/prince-backend/backend/push/apyns.py", line 38, in send
    body=json.dumps(payload, cls=PayloadEncoder)
  File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/http20/connection.py", line 250, in request
    with self._write_lock:
  File "/usr/lib/python2.7/threading.py", line 174, in acquire
    rc = self.__block.acquire(blocking)
  File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/hub.py", line 606, in switch
    return greenlet.switch(self)

Full log attached pcelery_async-stdout---supervisor-fWMgHq.log.gz

binrush avatar Aug 09 '16 14:08 binrush

@Lukasa do you have any suggestions on how we can deal with it?

binrush avatar Aug 12 '16 07:08 binrush

So right now it seems that we're getting stuck on the write lock. This raises an important question: is it possible that a thread has died holding the lock. Or, worse, timed out? Can you try setting socket.setdefaulttimeout(30) and see if that changes anything?

Lukasa avatar Aug 12 '16 07:08 Lukasa

@Lukasa with this setting, we are getting the following:

Traceback (most recent call last):
 File "/home/prince/proj/python/prince-backend/backend/push/push_apns2.py", line 92, in apns_send_msg
   status, apns_id, body = connection.feedback(stream_id)
 File "/home/prince/proj/python/prince-backend/backend/push/apyns.py", line 43, in feedback
   with self._conn.get_response(stream_id) as resp:
 File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/http20/connection.py", line 293, in get_response
   return HTTP20Response(stream.getheaders(), stream)
 File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/http20/stream.py", line 223, in getheaders
   self._recv_cb(stream_id=self.stream_id)
 File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/http20/connection.py", line 744, in _recv_cb
   self._single_read()
 File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/http20/connection.py", line 644, in _single_read
   self._sock.fill()
 File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/common/bufsocket.py", line 167, in fill
   count = self._sck.recv_into(self._buffer_view[self._buffer_end:])
 File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/_sslgte279.py", line 470, in recv_into
   return self.read(nbytes, buffer)
 File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/_sslgte279.py", line 314, in read
   self._wait(self._read_event, timeout_exc=_SSLErrorReadTimeout)
 File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/_socket2.py", line 179, in _wait
   self.hub.wait(watcher)
 File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/hub.py", line 627, in wait
   result = waiter.get()
 File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/hub.py", line 875, in get
   return self.hub.switch()
 File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/hub.py", line 606, in switch
   return greenlet.switch(self)
SSLError: ('The read operation timed out',)

binrush avatar Aug 12 '16 09:08 binrush

Hrm. That traceback is not terribly helpful, really, though it suggests that we're expecting a HEADERS frame and not getting one within at least 30 seconds. Specifically, in fact, that the connection is dead for 30 seconds. Do you have or can you obtain a tcpdump that shows no receive action on the socket for that 30 second period?

Lukasa avatar Aug 12 '16 09:08 Lukasa

@Lukasa here is tcpdump for single connection dump.zip

binrush avatar Aug 15 '16 04:08 binrush

Hrm. Sadly, nothing leaps out from that packet capture. =( This is beginning to get quite frustrating: we may need to start adding some kind of substantial logging to the codebase if we can't get a stacktrace from celery. =(

Lukasa avatar Aug 15 '16 08:08 Lukasa

@Lukasa maybe we can fork hyper, add some debug messages and use forked version in our project?

binrush avatar Aug 15 '16 08:08 binrush

That certainly works for me. In the first instance the most useful debug logs would be ones that dump out stream creation/destruction information (one log before this line dumping out newly created stream ID, one in each branch of this except block noting what stream ID is getting deleted and whether it's already been deleted), and then a few scattered around this somewhat complex method to see if there are bugs floating around in it.

Lukasa avatar Aug 15 '16 08:08 Lukasa

@Lukasa is this ok https://github.com/binrush/hyper/commit/987c7200fbaa0f750051e5d0c65c77906f5c0244 ?

binrush avatar Aug 15 '16 17:08 binrush

Some notes left inline in that commit diff.

Lukasa avatar Aug 16 '16 07:08 Lukasa

@Lukasa btw, is it ok to install from development branch? I have some problems installing:

rush@rush-laptop:~$ ~/venv/bin/pip install git+https://github.com/binrush/hyper.git@development
Collecting git+https://github.com/binrush/hyper.git@development
  Cloning https://github.com/binrush/hyper.git (to development) to /tmp/pip-McAADa-build
    Complete output from command python setup.py egg_info:
    error in hyper setup command: Invalid environment marker: python_full_version < "2.7.9"

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-McAADa-build/

binrush avatar Aug 17 '16 03:08 binrush

@binrush You probably need to update the pip in your virtualenv!

Lukasa avatar Aug 17 '16 07:08 Lukasa

@Lukasa here is log with debug messages we added pcelery_async-stdout---supervisor-fWMgHq.log.gz

binrush avatar Aug 18 '16 17:08 binrush