hyper
hyper copied to clipboard
Protocol error: Max outbound streams is 500, 500 open
Hello. We are using hyper to send push messages using Apple Push Notification Service. Some times we get TooManyStreamsError: Max outbound streams is 500, 500 open when sending HTTP2 request.
The code we use is quite simple:
class APNS(object):
def __init__(self, cert, key, dev=False, alt_port=False):
self.server = 'api.development.push.apple.com' if dev\
else 'api.push.apple.com'
self.port = 2197 if alt_port else 443
self._ssl_context = init_context(cert=(cert, key))
self._conn = HTTP20Connection(
host=self.server,
port=self.port,
force_proto='h2',
ssl_context=self._ssl_context)
def send(self, push_token, payload,
id_=None, expiration=None, priority=None, topic=None):
headers = HTTPHeaderMap()
if id_:
headers['apns-id'] = id_
if expiration:
headers['apns-expiration'] = expiration
if priority:
headers['apns-priority'] = priority
if topic:
headers['apns-topic'] = topic
stream = self._conn.request(
'POST',
'/3/device/' + push_token,
headers=headers,
body=json.dumps(payload, cls=PayloadEncoder)
)
return stream
def feedback(self, stream_id):
with self._conn.get_response(stream_id) as resp:
if resp.status != 200:
body = json.loads(resp.read())
else:
body = None
return (
resp.status,
resp.headers.get('apns-id', [None])[0],
body)
We use gevent celery worker to send pushes with maximum of 50 greenlets, so as I understand, there should not be more then 50 opened streams in every moment because we are closing stream after getting response.
Please explain what does this error mean and how we can work around it?
So there are two possibilities. One is that, while you might only allow 50 greenlets, the send greenlets might be starving out the feedback greenlets. This admittedly only happens if the size of the responses is > 64kB.
The other risk is that the non-200 errors aren't having their bodies consumed. Generally you should either read() the stream or reset it if you don't want it: otherwise, there's a risk of leaving the stream open.
However, at this point I think turning on logging is likely to be the most initially useful first pass. Do you mind enabling debug logging for me and seeing what the results produce?
(Side note: this is some extremely well-factored code, I approve highly!)
Thank you for reply and for hyper )
So there are two possibilities. One is that, while you might only allow 50 greenlets, the send greenlets might be starving out the feedback greenlets. This admittedly only happens if the size of the responses is > 64kB.
currently, we executing send() and feedback() in single celery task sequentially, so number of sends and feedbacks should be equal
However, at this point I think turning on logging is likely to be the most initially useful first pass. Do you mind enabling debug logging for me and seeing what the results produce?
I enabled debug for hyper and will post logs here when error will be reproduced
(Side note: this is some extremely well-factored code, I approve highly!)
Thank you!
TooManyStreamsError was not reproduced yet, but I have some debug logs. There are a lot of messages like this:
2016-08-03 13:05:00,770:DEBUG:hyper.http20.connection:recv for stream 0 with set([1, <many values from 1 to 16293> ]) already present
The last number (16293) and "recv for stream" number (0) are not changed within a number of last messages in logs. Is it normal situation?
I uploaded the whole log to dropbox, it's quite big (500 kb).
https://www.dropbox.com/s/a9rtr511dthfxbc/pcelery_async-stdout---supervisor-eUkFtI.log.gz?dl=0
That's somewhat interesting, but not that important in this case I think.
Those messages appear if a frame has been received on a given stream, but that stream has not yet attempted to read its frames. It exists as a guard against the possibility of having a stream stuck behind a read lock.
However, if a stream never has to emit a read: that is, if all the data that it was expecting got pushed to it by other streams reading, then it will never get evicted from that set. Its mere presence in that set is not normally that useful a debugging piece of information. It looks like we have a bug where we don't evict members from that set if a stream is cleanly closed, which would be good to fix (see #265). However, it's unrelated to this problem.
The fact that this is not reproduced regularly suggests, however, that it's a somewhat intermittent problem. Can you try regularly logging out the value of len(HTTP20Connection.streams)? That will help me get a feel for how many streams we think are open at one time.
We enabled logging of streams count and currently maximum what we have is 12 streams:
2016-08-04 13:39:43,959:DEBUG:backend.push.push_apns2:Have 12 opened streams
It seems strange because it rather far from 500 ).
Is it possible that TooManyStreamsError shows incorrect number of max_outbound_streams?
@binrush I doubt it. A quick glance at the relevant code doesn't seem wrong.
I wonder if this was only encountered in a weird error condition scenario.
We decided to decrease grenlets number to 10. Currently we have following in logs:
2016-08-07 00:54:07,361:ERROR:celery.worker.job.on_timeout:Hard time limit (900s) exceeded for backend.tasks.async.send_push[378a60a9-54aa-460b-be66-79c875ff6239]
2016-08-07 00:54:07,647:ERROR:backend.push.push_apns2.apns_send_msg:Too many opened streams, 168
2016-08-07 00:54:15,376:ERROR:celery.worker.job.on_timeout:Hard time limit (900s) exceeded for backend.tasks.async.send_push[0e46e1c8-65fb-4d40-83ad-7fc9cfb4f0ef]
2016-08-07 00:54:15,683:ERROR:backend.push.push_apns2.apns_send_msg:Too many opened streams, 169
So, looks like streams number is much greater then greenlets number
May be execution hangs at some moment, then celery kills worker dut to timeout and stream is left opened?
@binrush It's certainly possible. Do you have your "have opened streams" log for this as well?
@Lukasa here it is https://www.dropbox.com/s/ly57q0vfq8r5jtt/%D0%9D%D0%B0%D1%87%D0%B0%D0%BB%D0%BE%20%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D1%8B%20%D1%81%20Dropbox.pdf?dl=0
But seems there is no any messages from hyper except for one I mentioned above
Uh, I don't think that link is right.
@Lukasa sorry, here is right one https://www.dropbox.com/s/4ibew32vllgqmsv/lastlog.tar.gz?dl=0
Yup, timeouts do seem to be the problem here: it does seem like celery is timing those requests out and they never end up closed. It would be interesting to see why 900s are being taken for some of these requests.
BTW, in celery task we have something like this:
stream_id = connection.send(token, pl, topic=topic)
status, apns_id, body = connection.feedback(stream_id)
So seems like execution stucks on getting feedback.
Do you have any suggestions on how we can futher debug it?
@binrush Ideally it would be good to see a traceback from the point where celery killed the worker. Is that going to be available?
@Lukasa currently, no. I'm trying to figure out how to get it and will write here if I have any success with it
Unfortunately, I still did not fugure out, how to get stack trace for Hard time limit (900s) exceeded error, so I made a trace dump of all greenlets in moment when streams count became more then greenlet count. Most greenlets have following trace:
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent
/greenlet.py", line 534, in run
result = self._run(*self.args, **self.kwargs)
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/celery/concurrency/gevent.py", line 33, in apply_timeout
propagate=(Timeout, ), **rest)
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/celery/concurrency/base.py", line 36, in apply_target
ret = target(*args, **kwargs)
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/celery/app/trace.py", line 349, in _fast_trace_task
return _tasks[task].__trace__(uuid, args, kwargs, request)[0]
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/prince/proj/python/prince-backend/backend/api/api.py", line 256, in __call__
return TaskBase.__call__(self, *args, **kwargs)
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "/home/prince/proj/python/prince-backend/backend/tasks/async.py", line 36, in send_push
extra={'prince': msg.body})
File "/home/prince/proj/python/prince-backend/backend/push/push_apns2.py", line 86, in apns_send_msg
stream_id = connection.send(token, pl, topic=topic)
File "/home/prince/proj/python/prince-backend/backend/push/apyns.py", line 38, in send
body=json.dumps(payload, cls=PayloadEncoder)
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/http20/connection.py", line 250, in request
with self._write_lock:
File "/usr/lib/python2.7/threading.py", line 174, in acquire
rc = self.__block.acquire(blocking)
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/hub.py", line 606, in switch
return greenlet.switch(self)
Full log attached pcelery_async-stdout---supervisor-fWMgHq.log.gz
@Lukasa do you have any suggestions on how we can deal with it?
So right now it seems that we're getting stuck on the write lock. This raises an important question: is it possible that a thread has died holding the lock. Or, worse, timed out? Can you try setting socket.setdefaulttimeout(30) and see if that changes anything?
@Lukasa with this setting, we are getting the following:
Traceback (most recent call last):
File "/home/prince/proj/python/prince-backend/backend/push/push_apns2.py", line 92, in apns_send_msg
status, apns_id, body = connection.feedback(stream_id)
File "/home/prince/proj/python/prince-backend/backend/push/apyns.py", line 43, in feedback
with self._conn.get_response(stream_id) as resp:
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/http20/connection.py", line 293, in get_response
return HTTP20Response(stream.getheaders(), stream)
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/http20/stream.py", line 223, in getheaders
self._recv_cb(stream_id=self.stream_id)
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/http20/connection.py", line 744, in _recv_cb
self._single_read()
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/http20/connection.py", line 644, in _single_read
self._sock.fill()
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/hyper/common/bufsocket.py", line 167, in fill
count = self._sck.recv_into(self._buffer_view[self._buffer_end:])
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/_sslgte279.py", line 470, in recv_into
return self.read(nbytes, buffer)
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/_sslgte279.py", line 314, in read
self._wait(self._read_event, timeout_exc=_SSLErrorReadTimeout)
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/_socket2.py", line 179, in _wait
self.hub.wait(watcher)
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/hub.py", line 627, in wait
result = waiter.get()
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/hub.py", line 875, in get
return self.hub.switch()
File "/home/prince/.virtualenvs/prince/local/lib/python2.7/site-packages/gevent/hub.py", line 606, in switch
return greenlet.switch(self)
SSLError: ('The read operation timed out',)
Hrm. That traceback is not terribly helpful, really, though it suggests that we're expecting a HEADERS frame and not getting one within at least 30 seconds. Specifically, in fact, that the connection is dead for 30 seconds. Do you have or can you obtain a tcpdump that shows no receive action on the socket for that 30 second period?
@Lukasa here is tcpdump for single connection dump.zip
Hrm. Sadly, nothing leaps out from that packet capture. =( This is beginning to get quite frustrating: we may need to start adding some kind of substantial logging to the codebase if we can't get a stacktrace from celery. =(
@Lukasa maybe we can fork hyper, add some debug messages and use forked version in our project?
That certainly works for me. In the first instance the most useful debug logs would be ones that dump out stream creation/destruction information (one log before this line dumping out newly created stream ID, one in each branch of this except block noting what stream ID is getting deleted and whether it's already been deleted), and then a few scattered around this somewhat complex method to see if there are bugs floating around in it.
@Lukasa is this ok https://github.com/binrush/hyper/commit/987c7200fbaa0f750051e5d0c65c77906f5c0244 ?
Some notes left inline in that commit diff.
@Lukasa btw, is it ok to install from development branch? I have some problems installing:
rush@rush-laptop:~$ ~/venv/bin/pip install git+https://github.com/binrush/hyper.git@development
Collecting git+https://github.com/binrush/hyper.git@development
Cloning https://github.com/binrush/hyper.git (to development) to /tmp/pip-McAADa-build
Complete output from command python setup.py egg_info:
error in hyper setup command: Invalid environment marker: python_full_version < "2.7.9"
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-McAADa-build/
@binrush You probably need to update the pip in your virtualenv!
@Lukasa here is log with debug messages we added pcelery_async-stdout---supervisor-fWMgHq.log.gz