requests
requests copied to clipboard
Make chardet/charset_normalizer optional?
With a routine version bump of requirements, I noticed chardet
had been switched out for charset_normalizer
(which I had never heard of before) in #5797, apparently due to LGPL license concerns.
I agree with @sigmavirus24's comment https://github.com/psf/requests/pull/5797#issuecomment-875158955 that it's strange for something as central in the Python ecosystem as requests
is (45k stars, 8k forks, many contributors at the time of writing) to switch to such a relatively unknown and unproven library (132 stars, 5 forks, 2 contributors) for a hard dependency in something as central in the Python ecosystem as requests
is.
The release notes say you could use pip install "requests[use_chardet_on_py3]"
to use chardet
instead of charset_normalizer
, but with that extra set both libraries get installed.
I would imagine many users don't really necessarily need the charset detection features in Requests; could we open a discussion on making both chardet
/charset_normalizer
optional, á la requests[chardet]
or requests[charset_normalizer]
?
AFAICS, the only place where chardet
is actually used in requests
is Response.apparent_encoding
, which is used by Response.text
when there is no determined encoding.
Maybe apparent_encoding
could try to
- as a built-in first attempt, try decoding the content as UTF-8 (which would likely be successful for many cases)
- if neither
chardet
orcharset_normalizer
is installed, warn the user ("No encoding detection library is installed. Falling back to XXXX. Please see YYYY for instructions" or somesuch) and return e.g.ascii
- use either chardet library as per usual
apparent_encoding
genuinely just needs to go away. That can't be done until a major release. Once that happens, we don't need dependencies on either library
Hi @akx
Hopefully, this change will benefit the people who actually depend on charset detection.
charset_normalizer
cannot correctly handle some very normal entries, e.g., an empty JSON response:
>>> import charset_normalizer
>>> charset_normalizer.detect(b'{}')
/home/tseaver/projects/agendaless/Google/src/python-cloud-core/.nox/unit-3-6/lib/python3.6/site-packages/charset_normalizer/api.py:95: UserWarning: Trying to detect encoding from a tiny portion of (2) byte(s).
warn('Trying to detect encoding from a tiny portion of ({}) byte(s).'.format(length))
{'encoding': 'utf_16_be', 'language': '', 'confidence': 1.0}
>>> b'{}'.decode('utf_16_be')
'筽'
Note that charset_normalizer.detect
is actually deprecated:
>>> print(charset_normalizer.detect.__doc__)
chardet legacy method
Detect the encoding of the given byte string. It should be mostly backward-compatible.
Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it)
This function is deprecated and should be used to migrate your project easily, consult the documentation for
further information. Not planned for removal.
:param byte_str: The byte sequence to examine.
Thanks @tseaver, I believe this is something we'd called out in initial testing. The same issue happened for large strings containing only numbers but was seemingly randomly categorized as utf-16. I was under the impression this had been resolved though.
@Ousret can you take a look at this when you have a moment?
Hi @tseaver
Thanks for the report, I have seen that you opened an issue in charset_normalizer
. I will pursue this in the issue.
@nateprewitt of course.
Hopefully, this change will benefit the people who actually depend on charset detection.
As I said, most people end up far more frustrated by this "feature" than helped by it.
The change causes issues on my side as well. Seems charset_normalizer
does not detect below ascii string correctly (even when using the newest 2.0.2 version)
>>> rawdata = b'g4UsPJdfzNkGW2jwmKDGDilKGKYtpF2X.mx3MaTWL1tL7CNn5U7DeCcodKX7S3lwwJPKNjBT8etY'
>>> import charset_normalizer
>>> detected_cn = charset_normalizer.detect(rawdata)
>>> detected_cn
{'encoding': 'utf_16_le', 'language': '', 'confidence': 1.0}
>>> import chardet
>>> detected_cd = chardet.detect(rawdata)
>>> print(detected_cd)
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
>>>
For the moment I am using the workaround described in your release note. Will open an issue in charset_normalizer
.
I took the liberty of implementing a version of what I drafted out in the original post in PR #5875.
Decoding ASCII and UTF-8 ("UTF-8 is used by 97.0% of all the websites whose character encoding we know.") will continue to work without those libraries, and a helpful error is raised in other cases.
Although it's solved, just wanted to mention that this is indeed a crucial mechanic, since charset_normalizer
is still "young". For example, it doesn't work properly on Debian in some cases, while is quite consistent on Windows.
@a-maliarov can you explain yourself with the ‘work consistent on Windows but not Debian’ ? Need concrete case.
@Ousret hi, I think it would be better to post my specific issue within charset_normalizer
's repository. Will do later.
I am still not sure how it impacted me, but I had an encoding issue when upgrading requests to 2.26 with my pdf generation. As far as I can tell, the lib I am using (xhtml2pdf) does not use requests or chardet/charset_normalizer directly.
The code is simply:
html = 'html unicode string'
input_file = io.BytesIO(html.encode('utf-8'))
temp_file = NamedTemporaryFile('wb', delete=False)
xhtml2pdf.pisa.CreatePDF(input_file, temp_file, encoding='utf8')
I'm not even sure how to find where or how the change to charset_normalizer impacted this. My solution for now has been to pin requests to 2.25.1.
I'm not even sure how to find where or how the change to charset_normalizer impacted this. My solution for now has been to pin requests to 2.25.1.
You can simply install chardet
@Gagaro and it will be fully backwards-compatible (or use the right extra when installing requests)
In your requirements add requests as requests[use_chardet_on_py3]==2.26.0
or install it with pip install "requests[use_chardet_on_py3]==2.26.0"
- this should solve the problem entirely.
If not, then it means that it was a different chnage in 2.26.0 that impacted you.
Thanks, it works with [use_chardet_on_py3]
. I still don't see how this sub-dependency can have this impact.
Thanks, it works with
[use_chardet_on_py3]
. I still don't see how this sub-dependency can have this impact.
Using the charset-normalizer was done in backwards-compatible way to give people who implicitly depend on results of chardet
an easy way to switch back to use chardet.
Simply - if chardet is installed, it will be used instead of charset-normalizer.
It's known "property" of charset-normalizer that it sometimes might produce different results than chardet when encoding is guessed from content. Both Chardet and Charset-normalizer use some kind of heuristics to determine that and they both use different optimisations and shortcuts to make this guess "fast".
So when @ashb implemented the change he thought about users like you who might somehow depend on the way how chardet detects it and if chardet is installed, it will be used. The Gist of the change was that chardet should not be a "required" dependency because of the licencing is used. So it's not mandatory for requests. But if you install it as an optional extra or manually, that's fine (and to keep backwards compatibiliy it will be used as optional component). Thanks to that, the LGPL license (as being optional) does not limit the users of requests in redistributing their code and their users to redistribute it further.
Yes I understand all that, what I don't is why does it impact xhtml2pdf
which does not depend on requests
/ chardet
/ charset-normalizer
:
xhtml2pdf==0.2.4
- html5lib [required: >=1.0, installed: 1.1]
- six [required: >=1.9, installed: 1.15.0]
- webencodings [required: Any, installed: 0.5.1]
- Pillow [required: Any, installed: 7.2.0]
- pyPdf2 [required: Any, installed: 1.26.0]
- reportlab [required: >=3.0, installed: 3.5.46]
- pillow [required: >=4.0.0, installed: 7.2.0]
- six [required: Any, installed: 1.15.0]
Do chardet
or charset-normalizer
modify some things globally which could explain the side effects?
Thanks for taking the time to explain all of that.
@Gagaro html5lib
optionally requires chardet
and likely behaves differently if it's not installed.
https://github.com/html5lib/html5lib-python/blob/f7cab6f019ce94a1ec0192b6ff29aaebaf10b50d/requirements-optional.txt#L7-L9
@Gagaro
html5lib
optionally requireschardet
and likely behaves differently if it's not installed.
Ah. So requests is not the only one with optional chardet dependency :)
Indeed, nice catch! So we actually depend on chardet
without even knowing it :sweat_smile: .
I just upgraded from 2.25.1 to 2.26.0 and my logs now fill up with charset_normalizer lines,
Tons of them, I can fix it if I add explicit headers like 'Content-Type': 'application/scim+json; charset=utf-8'
to rest-apis and requests, but what about all the rest-apis out there without charset in characters, what is the workaround here.
I really do not see any reason for chardet or charset_normalizer at all in requests, just crash if we fail to decode so we can fix it.
it is just bloating the library, My initial tests also show that it is slower than previous version when using it with jwt tokens and base64 encoding.
Below is an excerp from a response from a keycloak(redhats OIDC OAUTH2 server) jwt token:
20211126154014.190|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (192 byte(s) given) parameters.
20211126154014.203|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (192 byte(s) given) parameters.
20211126154014.245|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (316 byte(s) given) parameters.
20211126154015.964|WARNING|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:104|override steps (5) and chunk_size (512) as content does not fit (3 byte(s)
given) parameters.
Just install chardet
.
And yeah. I think requests maintainer want to remove both in the future.
I also see performance degradation after moving from 2.25.1 to higher, this just bloats requests.
with requests==2.27.1
I is now even worse, info in this module is like DEBUG.
This call resulting in this have the following header:
headers = {
'Accept': 'application/scim+json',
'Content-Type': 'application/scim+json; charset=utf-8',
'Authorization': f"Bearer {access_token}"
}
20220113145635.412|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.415|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145635.416|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.417|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145635.538|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145635.541|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.303|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.304|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.307|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.308|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
20220113145636.373|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:376|ascii passed initial chaos probing. Mean measured chaos is 0.000000 %
20220113145636.373|INFO|C:\dist\venvs\trk-fullstack-test\lib\site-packages\charset_normalizer\api.py:430|ascii is most likely the one. Stopping the process.
So the first thing I do in my logging module is:
logging.getLogger('charset_normalizer').disabled = True
@fenchu this isn't really relevant to this issue and alternatives to not use charset_normalizer have already been provided. There are also multiple ways to disable the use of character detection entirely. You've explicitly chosen the one API that provides this feature. Reposting your grievance repeatedly isn't furthering the conversation here.
So to recap the thread for future readers:
- Yes, it should be optional. We want to remove character detection entirely. It doesn't belong in Requests. This won't happen until we major version bump.
- Installing
chardet
manually or usingrequests[use_chardet_on_py3]
will have the exact same functionality in 2.27.1 as it did in 2.25.1. -
content
anditer_content
both expose the data returned by the response without any attempted character encoding. If you don't think the character encoding is correct and don't want to use an auto-encoder, you can either manually encode thecontent
bytes or set theencoding
attribute on the Response before calling.text
.