requests Make chardet/charset_normalizer optional?

Make chardet/charset_normalizer optional?

Open akx opened this issue 2 years ago • 23 comments

With a routine version bump of requirements, I noticed chardet had been switched out for charset_normalizer (which I had never heard of before) in #5797, apparently due to LGPL license concerns.

I agree with @sigmavirus24's comment https://github.com/psf/requests/pull/5797#issuecomment-875158955 that it's strange for something as central in the Python ecosystem as requests is (45k stars, 8k forks, many contributors at the time of writing) to switch to such a relatively unknown and unproven library (132 stars, 5 forks, 2 contributors) for a hard dependency in something as central in the Python ecosystem as requests is.

The release notes say you could use pip install "requests[use_chardet_on_py3]" to use chardet instead of charset_normalizer, but with that extra set both libraries get installed.

I would imagine many users don't really necessarily need the charset detection features in Requests; could we open a discussion on making both chardet/charset_normalizer optional, á la requests[chardet] or requests[charset_normalizer]?

AFAICS, the only place where chardet is actually used in requests is Response.apparent_encoding, which is used by Response.text when there is no determined encoding.

Maybe apparent_encoding could try to

as a built-in first attempt, try decoding the content as UTF-8 (which would likely be successful for many cases)
if neither chardet or charset_normalizer is installed, warn the user ("No encoding detection library is installed. Falling back to XXXX. Please see YYYY for instructions" or somesuch) and return e.g. ascii
use either chardet library as per usual

Jul 14 '21 12:07 akx

apparent_encoding genuinely just needs to go away. That can't be done until a major release. Once that happens, we don't need dependencies on either library

Jul 14 '21 12:07 sigmavirus24

Hi @akx

Hopefully, this change will benefit the people who actually depend on charset detection.

Jul 14 '21 15:07 Ousret

charset_normalizer cannot correctly handle some very normal entries, e.g., an empty JSON response:

>>> import charset_normalizer
>>> charset_normalizer.detect(b'{}')
/home/tseaver/projects/agendaless/Google/src/python-cloud-core/.nox/unit-3-6/lib/python3.6/site-packages/charset_normalizer/api.py:95: UserWarning: Trying to detect encoding from a tiny portion of (2) byte(s).
  warn('Trying to detect encoding from a tiny portion of ({}) byte(s).'.format(length))
{'encoding': 'utf_16_be', 'language': '', 'confidence': 1.0}
>>> b'{}'.decode('utf_16_be')
'筽'

Note that charset_normalizer.detect is actually deprecated:

>>> print(charset_normalizer.detect.__doc__)

    chardet legacy method
    Detect the encoding of the given byte string. It should be mostly backward-compatible.
    Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it)
    This function is deprecated and should be used to migrate your project easily, consult the documentation for
    further information. Not planned for removal.

    :param byte_str:     The byte sequence to examine.

Jul 14 '21 17:07 tseaver

Thanks @tseaver, I believe this is something we'd called out in initial testing. The same issue happened for large strings containing only numbers but was seemingly randomly categorized as utf-16. I was under the impression this had been resolved though.

@Ousret can you take a look at this when you have a moment?

Jul 14 '21 17:07 nateprewitt

Hi @tseaver

Thanks for the report, I have seen that you opened an issue in charset_normalizer. I will pursue this in the issue. @nateprewitt of course.

Jul 14 '21 18:07 Ousret

Hopefully, this change will benefit the people who actually depend on charset detection.

As I said, most people end up far more frustrated by this "feature" than helped by it.

Jul 15 '21 00:07 sigmavirus24

The change causes issues on my side as well. Seems charset_normalizer does not detect below ascii string correctly (even when using the newest 2.0.2 version)

>>> rawdata = b'g4UsPJdfzNkGW2jwmKDGDilKGKYtpF2X.mx3MaTWL1tL7CNn5U7DeCcodKX7S3lwwJPKNjBT8etY'

>>> import charset_normalizer
>>> detected_cn = charset_normalizer.detect(rawdata)
>>> detected_cn
{'encoding': 'utf_16_le', 'language': '', 'confidence': 1.0}

>>> import chardet
>>> detected_cd = chardet.detect(rawdata)
>>> print(detected_cd)
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
>>>

For the moment I am using the workaround described in your release note. Will open an issue in charset_normalizer.

Jul 15 '21 05:07 grindsa

I took the liberty of implementing a version of what I drafted out in the original post in PR #5875.

Decoding ASCII and UTF-8 ("UTF-8 is used by 97.0% of all the websites whose character encoding we know.") will continue to work without those libraries, and a helpful error is raised in other cases.

Jul 15 '21 12:07 akx

Although it's solved, just wanted to mention that this is indeed a crucial mechanic, since charset_normalizer is still "young". For example, it doesn't work properly on Debian in some cases, while is quite consistent on Windows.

Sep 06 '21 13:09 a-maliarov

@a-maliarov can you explain yourself with the ‘work consistent on Windows but not Debian’ ? Need concrete case.

Sep 06 '21 15:09 Ousret

@Ousret hi, I think it would be better to post my specific issue within charset_normalizer's repository. Will do later.

Sep 06 '21 15:09 a-maliarov

I am still not sure how it impacted me, but I had an encoding issue when upgrading requests to 2.26 with my pdf generation. As far as I can tell, the lib I am using (xhtml2pdf) does not use requests or chardet/charset_normalizer directly.

The code is simply:

html = 'html unicode string'
input_file = io.BytesIO(html.encode('utf-8'))
temp_file = NamedTemporaryFile('wb', delete=False)
xhtml2pdf.pisa.CreatePDF(input_file, temp_file, encoding='utf8')

I'm not even sure how to find where or how the change to charset_normalizer impacted this. My solution for now has been to pin requests to 2.25.1.

Sep 22 '21 09:09 Gagaro

I'm not even sure how to find where or how the change to charset_normalizer impacted this. My solution for now has been to pin requests to 2.25.1.

You can simply install chardet @Gagaro and it will be fully backwards-compatible (or use the right extra when installing requests) In your requirements add requests as requests[use_chardet_on_py3]==2.26.0 or install it with pip install "requests[use_chardet_on_py3]==2.26.0" - this should solve the problem entirely.

If not, then it means that it was a different chnage in 2.26.0 that impacted you.

Sep 22 '21 10:09 potiuk

Thanks, it works with [use_chardet_on_py3]. I still don't see how this sub-dependency can have this impact.

Sep 22 '21 11:09 Gagaro

Thanks, it works with [use_chardet_on_py3]. I still don't see how this sub-dependency can have this impact.

Using the charset-normalizer was done in backwards-compatible way to give people who implicitly depend on results of chardet an easy way to switch back to use chardet.

Simply - if chardet is installed, it will be used instead of charset-normalizer.

It's known "property" of charset-normalizer that it sometimes might produce different results than chardet when encoding is guessed from content. Both Chardet and Charset-normalizer use some kind of heuristics to determine that and they both use different optimisations and shortcuts to make this guess "fast".

So when @ashb implemented the change he thought about users like you who might somehow depend on the way how chardet detects it and if chardet is installed, it will be used. The Gist of the change was that chardet should not be a "required" dependency because of the licencing is used. So it's not mandatory for requests. But if you install it as an optional extra or manually, that's fine (and to keep backwards compatibiliy it will be used as optional component). Thanks to that, the LGPL license (as being optional) does not limit the users of requests in redistributing their code and their users to redistribute it further.

Sep 22 '21 11:09 potiuk

Yes I understand all that, what I don't is why does it impact xhtml2pdf which does not depend on requests / chardet / charset-normalizer :

xhtml2pdf==0.2.4
  - html5lib [required: >=1.0, installed: 1.1]
    - six [required: >=1.9, installed: 1.15.0]
    - webencodings [required: Any, installed: 0.5.1]
  - Pillow [required: Any, installed: 7.2.0]
  - pyPdf2 [required: Any, installed: 1.26.0]
  - reportlab [required: >=3.0, installed: 3.5.46]
    - pillow [required: >=4.0.0, installed: 7.2.0]
  - six [required: Any, installed: 1.15.0]

Do chardet or charset-normalizer modify some things globally which could explain the side effects?

Thanks for taking the time to explain all of that.

Sep 22 '21 12:09 Gagaro

@Gagaro html5lib optionally requires chardet and likely behaves differently if it's not installed.

https://github.com/html5lib/html5lib-python/blob/f7cab6f019ce94a1ec0192b6ff29aaebaf10b50d/requirements-optional.txt#L7-L9

Sep 22 '21 12:09 akx

@Gagaro html5lib optionally requires chardet and likely behaves differently if it's not installed.

Ah. So requests is not the only one with optional chardet dependency :)

Sep 22 '21 12:09 potiuk

Indeed, nice catch! So we actually depend on chardet without even knowing it :sweat_smile: .

Sep 22 '21 13:09 Gagaro

I just upgraded from 2.25.1 to 2.26.0 and my logs now fill up with charset_normalizer lines, Tons of them, I can fix it if I add explicit headers like 'Content-Type': 'application/scim+json; charset=utf-8' to rest-apis and requests, but what about all the rest-apis out there without charset in characters, what is the workaround here. I really do not see any reason for chardet or charset_normalizer at all in requests, just crash if we fail to decode so we can fix it. it is just bloating the library, My initial tests also show that it is slower than previous version when using it with jwt tokens and base64 encoding.

Below is an excerp from a response from a keycloak(redhats OIDC OAUTH2 server) jwt token:

20211126154014.190|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (192 byte(s) given) parameters.
20211126154014.203|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (192 byte(s) given) parameters.
20211126154014.245|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (316 byte(s) given) parameters.
20211126154015.964|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (3 byte(s) 
given) parameters.

Nov 26 '21 15:11 fenchu

Just install chardet.

And yeah. I think requests maintainer want to remove both in the future.

Nov 26 '21 15:11 potiuk

I also see performance degradation after moving from 2.25.1 to higher, this just bloats requests.

with requests==2.27.1 I is now even worse, info in this module is like DEBUG.

This call resulting in this have the following header:

    headers = {
        'Accept': 'application/scim+json',
        'Content-Type': 'application/scim+json; charset=utf-8',
        'Authorization': f"Bearer {access_token}"
    }

20220113145635.412|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.415|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145635.416|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.417|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145635.538|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.541|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.303|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.304|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.307|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.308|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.373|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.373|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.

So the first thing I do in my logging module is:

logging.getLogger('charset_normalizer').disabled = True

Jan 13 '22 14:01 fenchu

@fenchu this isn't really relevant to this issue and alternatives to not use charset_normalizer have already been provided. There are also multiple ways to disable the use of character detection entirely. You've explicitly chosen the one API that provides this feature. Reposting your grievance repeatedly isn't furthering the conversation here.

So to recap the thread for future readers:

Yes, it should be optional. We want to remove character detection entirely. It doesn't belong in Requests. This won't happen until we major version bump.
Installing chardet manually or using requests[use_chardet_on_py3] will have the exact same functionality in 2.27.1 as it did in 2.25.1.
content and iter_content both expose the data returned by the response without any attempted character encoding. If you don't think the character encoding is correct and don't want to use an auto-encoder, you can either manually encode the content bytes or set the encoding attribute on the Response before calling .text.

Jan 13 '22 15:01 nateprewitt

requests requests copied to clipboard

Make chardet/charset_normalizer optional?

requests
requests copied to clipboard