pyopenssl requests with pyOpenSSL is drastically slower than with stdlib's ssl module

Hello pyOpenSSL gurus!

We have a file transfer client (https://github.com/Azure/azure-data-lake-store-python) that relies on: cffi oathlib requests requests-oathlib

When we run performance benchmarks on this client with just those packages installed, we are seeing upload throughput around 7-8gbps. When pyOpenSSL is installed (by a different package that depends on it) or upload throughput drops to no higher than 2gbps.

Based on the above info, I have a couple questions:

How is pyOpenSSL impacting my package that does not take any dependency on it, and how can I prevent it from having an impact, if I can (that would be ideal!)?
Is this a known issue in pyOpenSSL when submitting a lot of PUT requests to a server in parallel? Is there something I can do in my package to have it work better with pyOpenSSL in this scenario?

Please let me know if you need any other information from me or if you have any quick fixes that I can try out and thank you all so much for your time! Ben

May 09 '17 22:05 begoldsm

requests will use pyOpenSSL by default when it is present in an environment, so that's why you're seeing this. You can tell requests not to do that with requests.packages.urllib3.contrib.pyopenssl.extract_from_urllib3.

However, a drop in performance like that is very unexpected anyway. What version of pyOpenSSL is this occurring under?

May 09 '17 22:05 reaperhulk

Thanks for the quick reply @reaperhulk! We tried with versions 17.0.0, 16.2.0 and 16.1.0. Is there something in specific that I should be looking for that could be the culprit? For now, I can make an update to our code to not rely on pyOpenSSL, but I assume that, overall, it would be beneficial over basic SSL, so we would like to be able to use it without negative impact :).

The only custom thing we do with requests is this:

    @property
    def session(self):
        try:
            s = self.local.session
        except AttributeError:
            s = None
        if not s:
            adapter = requests.adapters.HTTPAdapter(
                pool_connections=MAX_POOL_CONNECTIONS,
                pool_maxsize=MAX_POOL_CONNECTIONS)
            s = requests.Session()
            s.mount(self.url, adapter)
            self.local.session = s
        return s

May 09 '17 23:05 begoldsm

Well that's all the latest versions (which is really all I was curious about). It would definitely be good to improve pyOpenSSL's performance here but I'm not sure if anyone has the time to take on a project like this right now. @Lukasa do you have an opinion about this?

May 09 '17 23:05 reaperhulk

@reaperhulk one other question, since I am still fairly new to python: How/where would I use requests.packages.urllib3.contrib.pyopenssl.extract_from_urllib3 to ensure that it doesn't replace the ssl socket? Do I just import it and call it directly, or do I need to do something else?

May 09 '17 23:05 begoldsm

If you import that and call it after importing requests it should be fine.

May 09 '17 23:05 reaperhulk

Is this a CFFI thing?

May 10 '17 05:05 hynek

Yeah, my question is the same as @hynek's: is this a CFFI thing?

PyOpenSSL is likely, in general, to do more data copying than the standard library does. It's possible that this gets a bit worse if the client is sending extremely large chunks of data, as a bunch of time might be spent spinning around in the PyOpenSSL send implementation, copying data into and out of buffers. I noticed this when I was adding bytearray support earlier: because of the compatibility with older cffi versions PyOpenSSL can be quite copy-happy.

@begoldsm It'd be really interesting to see if you can get a cProfile output for the two situations. I wonder where in PyOpenSSL/cffi we're spending most of our time.

May 10 '17 07:05 Lukasa

@Lukasa sounds good! I have an initial profile run through the new Visual Studio profiler but that didn't yield too much information/obvious information (just that the code spent roughly the same amount of exclusive time in the two different SSL stacks, which is expected). I will get a cProfile run done and attach the output once I have it.

May 10 '17 17:05 begoldsm

@Lukasa I have uploaded my two cProfile logs. I took a look and didn't see anything really obvious that popped out at me, but you can see the time difference between the two (same script run): 190 seconds to upload 100GB when pyOpenSSL is not installed and over 500 seconds when it is installed.

cprofileWithAndWithoutPyOpenSSL.zip

May 10 '17 20:05 begoldsm

Cool, I'll chase this tomorrow morning. If anyone wants to take a swing before me, go for it.

May 10 '17 20:05 Lukasa

@Lukasa and @reaperhulk one more question to try and make the work around as complete as I can. On ubuntu/debian systems if someone chooses to install our package outside of a venv (I know this is a bad practice, but just in case :) ), do you have the equivalent "undo monkey-patch" code for that requests module? I believe it is called python-requests, and it separates out urllib3 from requests.

Thanks again for all the support guys!

May 10 '17 21:05 begoldsm

@begoldsm urllib3.contrib.pyopenssl.extract_from_urllib3 should do it in that case. We can all shake a fist at the Debian project for their vendoring policy there...

May 10 '17 21:05 reaperhulk

So the first thing to note is that, in both cases, the bulk of the time is spent in time.sleep. That's...surprising. The top of the "withPyOpenSSL" stack, sorted by "time spent in the function", looks like this:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    448/1    0.017    0.000  562.058  562.058 {built-in method builtins.exec}
        1    0.000    0.000  562.058  562.058 begoldsmUpload100GB.py:1(<module>)
        1    0.000    0.000  560.360  560.360 multithread.py:343(__init__)
        1    0.003    0.003  558.875  558.875 multithread.py:455(run)
        1    0.000    0.000  558.872  558.872 transfer.py:442(run)
        1    0.000    0.000  556.658  556.658 transfer.py:510(monitor)
        1    0.046    0.046  556.653  556.653 transfer.py:478(_wait)
     5390  556.490    0.103  556.490    0.103 {built-in method time.sleep}
        6    0.000    0.000    2.645    0.441 sessions.py:440(request)

Note that the vast, vast majority of your runtime is spent in time.sleep. This is true in the non-PyOpenSSL case, too:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    392/1    0.014    0.000  190.496  190.496 {built-in method builtins.exec}
        1    0.000    0.000  190.496  190.496 begoldsmUpload100GB.py:1(<module>)
        1    0.000    0.000  188.146  188.146 multithread.py:343(__init__)
        1    0.000    0.000  186.574  186.574 transfer.py:442(run)
        1    0.000    0.000  186.574  186.574 multithread.py:455(run)
        1    0.000    0.000  180.715  180.715 transfer.py:510(monitor)
        1    0.017    0.017  180.712  180.712 transfer.py:478(_wait)
     1664  180.657    0.109  180.657    0.109 {built-in method time.sleep}
      100    0.009    0.000    5.752    0.058 transfer.py:310(_start)

So I'd be interested to know what is sleeping: when we enter built-in method time.sleep, what does the call stack look like? I also don't know what the hell transfer.py or multithread.py are, which raises some interesting questions, as they're the only bits of code that can completely encompass the sleep calls.

May 11 '17 07:05 Lukasa

@Lukasa good questions. The time.sleep calls are mostly happening in a polling function that is checking on the current state of the operation. The code in multithread.py is "parallelizing" requests to the web server and transfer.py is keeping track of the status of those requests in a polling function. The basic algorithm is this: Given a large local file (say 100GB):

partition it into N chunks, where N is the number "threads" to use
Upload those chunks to separate files on the server, sending data 4MB at a time for each chunk
Once all chunks are uploaded, tell the server to concatenate them together into a single file

May 11 '17 17:05 begoldsm

For reference, here is where the time.sleep is being called the most (we also have back-off retry logic if a request fails, but we would see more time spent in core.py if that were the case):

https://github.com/Azure/azure-data-lake-store-python/blob/master/azure/datalake/store/transfer.py#L483

Ultimately multithread.py just defines the logic that transfer.py will use to execute the transfer (whether it is upload or download and any specific customization of how the transfer will take place).

The core request logic that sends data is here: https://github.com/Azure/azure-data-lake-store-python/blob/master/azure/datalake/store/multithread.py#L486

Which ultimately calls the PUT http method for each small 4mb piece of data in the chunk, there are a series of helper functions along the way but they ultimately call this one with a PUT operation and a 4mb buffer of data to send: https://github.com/Azure/azure-data-lake-store-python/blob/master/azure/datalake/store/lib.py#L297

May 11 '17 17:05 begoldsm

@Lukasa I am kind of bombarding you guys with a lot of solution specific information. Is there any additional debug information (perhaps from cProfile or another tool) that I can gather for you? I am admittedly fairly new to python so I am not sure what information is best for debugging this for you guys, or how to go about obtaining it for you. I am definitely interested in helping get to the bottom of this though, so anything you need just let me know!

Thanks again!

May 11 '17 17:05 begoldsm

@begoldsm So one real possibility about where this slowdown is coming from is contention for the GIL. Substantially more of the stdlib ssl module is written in C, where it does not hold the GIL, but that route is not taken by PyOpenSSL. Given that you're almost busy-looping in the monitoring thread, that may be causing us some problems: I wouldn't be at all surprised to find dramatically more contention on the GIL.

There isn't much else going on. The PyOpenSSL code spends about half as much time as the C code acquiring Python locks, so I don't think your queues are a concern. Both spend about the same amount of time dealing with SSL_read as each other, which suggests that the rest of the output is pretty unlikely to be helpful.

Want to try removing the progress monitor to see if the problem persists?

May 11 '17 18:05 Lukasa

@Lukasa I just tried commenting out all the time.sleep and all of the _wait() method with no luck. I am generating a new set of cProfiles now to see if there is a difference.

May 11 '17 19:05 begoldsm

@Lukasa I have data from the runs without the sleep, which is very strange. In both cases it claims it only executed for < 10 seconds, even though the actual run time (real time) was about the same as it was previously (190 without pySSL and 560 with pySSL). I have attached the raw data that can be manipulated with pstats, but I am confused as to why it would show such a small run number. Here is some sample output when using time to measure the command, as well as the output of pstats:

command: (begoldsmenv07) adlsperf@adlsperfd14x2:~/begoldsmenv07/azure-data-lake-store-python$ time python -m cProfile -o ~/profileNoSleepNoPySSL.txt -s Time ~/begoldsmUpload100GB.py

output (of time and pstats)

real    2m53.481s
user    11m30.976s
sys     2m44.064s
(begoldsmenv07) adlsperf@adlsperfd14x2:~/begoldsmenv07/azure-data-lake-store-python$ python
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pstats
>>> p = pstats.Stats('/home/adlsperf/profileNoSleepNoPySSL.txt')
>>> p.print_stats(1)
Thu May 11 21:10:08 2017    /home/adlsperf/profileNoSleepNoPySSL.txt

         384171 function calls (375878 primitive calls) in 7.924 seconds

   Random listing order was used
   List reduced from 3114 to 1 due to restriction <1>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.002    0.002 /home/adlsperf/begoldsmenv07/lib/python3.5/site-packages/pkg_resources/__init__.py:950(subscribe)

rawPStatsData.zip

May 11 '17 21:05 begoldsm

In both cases it claims it only executed for < 10 seconds, even though the actual run time (real time) was about the same as it was previously (190 without pySSL and 560 with pySSL). I have attached the raw data that can be manipulated with pstats, but I am confused as to why it would show such a small run number.

This is usually an indication that the time is being spent inside C functions that don't have useful Python antecedents in their stack. It might be useful to get a profile at the C level. I'm not 100% up-to-speed on how you'd do this with Linux: probably using perf, right?

May 12 '17 07:05 Lukasa

Thanks @Lukasa! I am actually not too familiar with profiling on Linux myself (or profiling in general, this is all brand new to me :) ). For now, I am going to move forward with this specific module ensuring pySSL isn't used for the rapid request data transfer logic while we continue to dive into what is going on at the deeper levels.

Thank you all again for helping me to understand what is going on here and giving me a quick work around!

May 30 '17 17:05 begoldsm

I'm having this problem as well.... Profiling indicates a huge amount of time spent in SSL libs, specifically in SSL_read... This is on Ubuntu 14.04 with both python 2.7.6 as well as 2.7.14. Using requests.packages.urllib3.contrib.pyopenssl.extract_from_urllib3 as a workaround fixes the speed problem, but then instead I get errors like

SSLError: [Errno 1] _ssl.c:510: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure

The original slowness looks like this

       15    0.000    0.000    0.000    0.000 {built-in method SSL_new}
       24   44.685    1.862   44.685    1.862 {built-in method SSL_read}
       15    0.000    0.000    0.000    0.000 {built-in method SSL_set_connect_state}
       15    0.000    0.000    0.000    0.000 {built-in method SSL_set_fd}
       15    0.000    0.000    0.000    0.000 {built-in method SSL_set_tlsext_host_name}

EDIT Okay so looks like I was able to fix whatever the underying problem on Ubuntu 14.04 by (a) Upgrading to python 2.7.13 (not sure if this had any effect), but specifically (b) installing 0.13.1 version of pyopenssl as described here https://stackoverflow.com/questions/31649390/python-requests-ssl-handshake-failure

(b)

[us-east-1]:~$ python --version
Python 2.7.14

[us-east-1]:~$ sudo pip2.7 install -U pyopenssl==0.13.1 pyasn1 ndg-httpsclient

Oct 23 '17 20:10 bitshark

This slowdown is present on latest versions of pyOpenSSL?

Sep 06 '22 20:09 HMaker

pyopenssl pyopenssl copied to clipboard

requests with pyOpenSSL is drastically slower than with stdlib's ssl module

pyopenssl
pyopenssl copied to clipboard