requests
requests copied to clipboard
utils.get_encodings_from_content regexps incorrect matches
get_encodings_from_contents
uses regexps:
charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')
I'm finding cases where this is matching "random junk charset=something_weird". A real-life example is at https://www.walmart.com/ip/108356879 where I get 7 matches. The first one gives the desired "utf-8". The next five are all "UTF-8\". The last one matches on a 24711 character match and produces a 1730 character gibberish result.
Locally I've fixed this by changing the regexp patterns, replacing the first ".?" with "[^>\]?"
Would a PR implementing this in requests (and/or requests-toolbelt) be welcome?
A PR fixing this behavior would be welcome!