pyopenssl
pyopenssl copied to clipboard
requests with pyOpenSSL is drastically slower than with stdlib's ssl module
Hello pyOpenSSL gurus!
We have a file transfer client (https://github.com/Azure/azure-data-lake-store-python) that relies on: cffi oathlib requests requests-oathlib
When we run performance benchmarks on this client with just those packages installed, we are seeing upload throughput around 7-8gbps. When pyOpenSSL is installed (by a different package that depends on it) or upload throughput drops to no higher than 2gbps.
Based on the above info, I have a couple questions:
- How is pyOpenSSL impacting my package that does not take any dependency on it, and how can I prevent it from having an impact, if I can (that would be ideal!)?
- Is this a known issue in pyOpenSSL when submitting a lot of PUT requests to a server in parallel? Is there something I can do in my package to have it work better with pyOpenSSL in this scenario?
Please let me know if you need any other information from me or if you have any quick fixes that I can try out and thank you all so much for your time! Ben
requests will use pyOpenSSL by default when it is present in an environment, so that's why you're seeing this. You can tell requests not to do that with requests.packages.urllib3.contrib.pyopenssl.extract_from_urllib3.
However, a drop in performance like that is very unexpected anyway. What version of pyOpenSSL is this occurring under?
Thanks for the quick reply @reaperhulk! We tried with versions 17.0.0, 16.2.0 and 16.1.0. Is there something in specific that I should be looking for that could be the culprit? For now, I can make an update to our code to not rely on pyOpenSSL, but I assume that, overall, it would be beneficial over basic SSL, so we would like to be able to use it without negative impact :).
The only custom thing we do with requests is this:
@property
def session(self):
try:
s = self.local.session
except AttributeError:
s = None
if not s:
adapter = requests.adapters.HTTPAdapter(
pool_connections=MAX_POOL_CONNECTIONS,
pool_maxsize=MAX_POOL_CONNECTIONS)
s = requests.Session()
s.mount(self.url, adapter)
self.local.session = s
return s
Well that's all the latest versions (which is really all I was curious about). It would definitely be good to improve pyOpenSSL's performance here but I'm not sure if anyone has the time to take on a project like this right now. @Lukasa do you have an opinion about this?
@reaperhulk one other question, since I am still fairly new to python: How/where would I use requests.packages.urllib3.contrib.pyopenssl.extract_from_urllib3 to ensure that it doesn't replace the ssl socket? Do I just import it and call it directly, or do I need to do something else?
If you import that and call it after importing requests it should be fine.
Is this a CFFI thing?
Yeah, my question is the same as @hynek's: is this a CFFI thing?
PyOpenSSL is likely, in general, to do more data copying than the standard library does. It's possible that this gets a bit worse if the client is sending extremely large chunks of data, as a bunch of time might be spent spinning around in the PyOpenSSL send implementation, copying data into and out of buffers. I noticed this when I was adding bytearray support earlier: because of the compatibility with older cffi versions PyOpenSSL can be quite copy-happy.
@begoldsm It'd be really interesting to see if you can get a cProfile output for the two situations. I wonder where in PyOpenSSL/cffi we're spending most of our time.
@Lukasa sounds good! I have an initial profile run through the new Visual Studio profiler but that didn't yield too much information/obvious information (just that the code spent roughly the same amount of exclusive time in the two different SSL stacks, which is expected). I will get a cProfile run done and attach the output once I have it.
@Lukasa I have uploaded my two cProfile logs. I took a look and didn't see anything really obvious that popped out at me, but you can see the time difference between the two (same script run): 190 seconds to upload 100GB when pyOpenSSL is not installed and over 500 seconds when it is installed.
Cool, I'll chase this tomorrow morning. If anyone wants to take a swing before me, go for it.
@Lukasa and @reaperhulk one more question to try and make the work around as complete as I can. On ubuntu/debian systems if someone chooses to install our package outside of a venv (I know this is a bad practice, but just in case :) ), do you have the equivalent "undo monkey-patch" code for that requests module? I believe it is called python-requests, and it separates out urllib3 from requests.
Thanks again for all the support guys!
@begoldsm urllib3.contrib.pyopenssl.extract_from_urllib3 should do it in that case. We can all shake a fist at the Debian project for their vendoring policy there...
So the first thing to note is that, in both cases, the bulk of the time is spent in time.sleep. That's...surprising. The top of the "withPyOpenSSL" stack, sorted by "time spent in the function", looks like this:
ncalls tottime percall cumtime percall filename:lineno(function)
448/1 0.017 0.000 562.058 562.058 {built-in method builtins.exec}
1 0.000 0.000 562.058 562.058 begoldsmUpload100GB.py:1(<module>)
1 0.000 0.000 560.360 560.360 multithread.py:343(__init__)
1 0.003 0.003 558.875 558.875 multithread.py:455(run)
1 0.000 0.000 558.872 558.872 transfer.py:442(run)
1 0.000 0.000 556.658 556.658 transfer.py:510(monitor)
1 0.046 0.046 556.653 556.653 transfer.py:478(_wait)
5390 556.490 0.103 556.490 0.103 {built-in method time.sleep}
6 0.000 0.000 2.645 0.441 sessions.py:440(request)
Note that the vast, vast majority of your runtime is spent in time.sleep. This is true in the non-PyOpenSSL case, too:
ncalls tottime percall cumtime percall filename:lineno(function)
392/1 0.014 0.000 190.496 190.496 {built-in method builtins.exec}
1 0.000 0.000 190.496 190.496 begoldsmUpload100GB.py:1(<module>)
1 0.000 0.000 188.146 188.146 multithread.py:343(__init__)
1 0.000 0.000 186.574 186.574 transfer.py:442(run)
1 0.000 0.000 186.574 186.574 multithread.py:455(run)
1 0.000 0.000 180.715 180.715 transfer.py:510(monitor)
1 0.017 0.017 180.712 180.712 transfer.py:478(_wait)
1664 180.657 0.109 180.657 0.109 {built-in method time.sleep}
100 0.009 0.000 5.752 0.058 transfer.py:310(_start)
So I'd be interested to know what is sleeping: when we enter built-in method time.sleep, what does the call stack look like? I also don't know what the hell transfer.py or multithread.py are, which raises some interesting questions, as they're the only bits of code that can completely encompass the sleep calls.
@Lukasa good questions. The time.sleep calls are mostly happening in a polling function that is checking on the current state of the operation. The code in multithread.py is "parallelizing" requests to the web server and transfer.py is keeping track of the status of those requests in a polling function. The basic algorithm is this: Given a large local file (say 100GB):
- partition it into N chunks, where N is the number "threads" to use
- Upload those chunks to separate files on the server, sending data 4MB at a time for each chunk
- Once all chunks are uploaded, tell the server to concatenate them together into a single file
For reference, here is where the time.sleep is being called the most (we also have back-off retry logic if a request fails, but we would see more time spent in core.py if that were the case):
https://github.com/Azure/azure-data-lake-store-python/blob/master/azure/datalake/store/transfer.py#L483
Ultimately multithread.py just defines the logic that transfer.py will use to execute the transfer (whether it is upload or download and any specific customization of how the transfer will take place).
The core request logic that sends data is here: https://github.com/Azure/azure-data-lake-store-python/blob/master/azure/datalake/store/multithread.py#L486
Which ultimately calls the PUT http method for each small 4mb piece of data in the chunk, there are a series of helper functions along the way but they ultimately call this one with a PUT operation and a 4mb buffer of data to send: https://github.com/Azure/azure-data-lake-store-python/blob/master/azure/datalake/store/lib.py#L297
@Lukasa I am kind of bombarding you guys with a lot of solution specific information. Is there any additional debug information (perhaps from cProfile or another tool) that I can gather for you? I am admittedly fairly new to python so I am not sure what information is best for debugging this for you guys, or how to go about obtaining it for you. I am definitely interested in helping get to the bottom of this though, so anything you need just let me know!
Thanks again!
@begoldsm So one real possibility about where this slowdown is coming from is contention for the GIL. Substantially more of the stdlib ssl module is written in C, where it does not hold the GIL, but that route is not taken by PyOpenSSL. Given that you're almost busy-looping in the monitoring thread, that may be causing us some problems: I wouldn't be at all surprised to find dramatically more contention on the GIL.
There isn't much else going on. The PyOpenSSL code spends about half as much time as the C code acquiring Python locks, so I don't think your queues are a concern. Both spend about the same amount of time dealing with SSL_read as each other, which suggests that the rest of the output is pretty unlikely to be helpful.
Want to try removing the progress monitor to see if the problem persists?
@Lukasa I just tried commenting out all the time.sleep and all of the _wait() method with no luck. I am generating a new set of cProfiles now to see if there is a difference.
@Lukasa I have data from the runs without the sleep, which is very strange. In both cases it claims it only executed for < 10 seconds, even though the actual run time (real time) was about the same as it was previously (190 without pySSL and 560 with pySSL). I have attached the raw data that can be manipulated with pstats, but I am confused as to why it would show such a small run number. Here is some sample output when using time to measure the command, as well as the output of pstats:
command: (begoldsmenv07) adlsperf@adlsperfd14x2:~/begoldsmenv07/azure-data-lake-store-python$ time python -m cProfile -o ~/profileNoSleepNoPySSL.txt -s Time ~/begoldsmUpload100GB.py
output (of time and pstats)
real 2m53.481s
user 11m30.976s
sys 2m44.064s
(begoldsmenv07) adlsperf@adlsperfd14x2:~/begoldsmenv07/azure-data-lake-store-python$ python
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pstats
>>> p = pstats.Stats('/home/adlsperf/profileNoSleepNoPySSL.txt')
>>> p.print_stats(1)
Thu May 11 21:10:08 2017 /home/adlsperf/profileNoSleepNoPySSL.txt
384171 function calls (375878 primitive calls) in 7.924 seconds
Random listing order was used
List reduced from 3114 to 1 due to restriction <1>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.002 0.002 /home/adlsperf/begoldsmenv07/lib/python3.5/site-packages/pkg_resources/__init__.py:950(subscribe)
In both cases it claims it only executed for < 10 seconds, even though the actual run time (real time) was about the same as it was previously (190 without pySSL and 560 with pySSL). I have attached the raw data that can be manipulated with pstats, but I am confused as to why it would show such a small run number.
This is usually an indication that the time is being spent inside C functions that don't have useful Python antecedents in their stack. It might be useful to get a profile at the C level. I'm not 100% up-to-speed on how you'd do this with Linux: probably using perf, right?
Thanks @Lukasa! I am actually not too familiar with profiling on Linux myself (or profiling in general, this is all brand new to me :) ). For now, I am going to move forward with this specific module ensuring pySSL isn't used for the rapid request data transfer logic while we continue to dive into what is going on at the deeper levels.
Thank you all again for helping me to understand what is going on here and giving me a quick work around!
I'm having this problem as well.... Profiling indicates a huge amount of time spent in SSL libs, specifically in SSL_read... This is on Ubuntu 14.04 with both python 2.7.6 as well as 2.7.14. Using requests.packages.urllib3.contrib.pyopenssl.extract_from_urllib3 as a workaround fixes the speed problem, but then instead I get errors like
SSLError: [Errno 1] _ssl.c:510: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure
The original slowness looks like this
15 0.000 0.000 0.000 0.000 {built-in method SSL_new}
24 44.685 1.862 44.685 1.862 {built-in method SSL_read}
15 0.000 0.000 0.000 0.000 {built-in method SSL_set_connect_state}
15 0.000 0.000 0.000 0.000 {built-in method SSL_set_fd}
15 0.000 0.000 0.000 0.000 {built-in method SSL_set_tlsext_host_name}
EDIT Okay so looks like I was able to fix whatever the underying problem on Ubuntu 14.04 by (a) Upgrading to python 2.7.13 (not sure if this had any effect), but specifically (b) installing 0.13.1 version of pyopenssl as described here https://stackoverflow.com/questions/31649390/python-requests-ssl-handshake-failure
(b)
[us-east-1]:~$ python --version
Python 2.7.14
[us-east-1]:~$ sudo pip2.7 install -U pyopenssl==0.13.1 pyasn1 ndg-httpsclient
This slowdown is present on latest versions of pyOpenSSL?