requests utils.get_encodings_from_content regexps incorrect matches

utils.get_encodings_from_content regexps incorrect matches

Open jbrockmendel opened this issue 5 years ago • 1 comments

get_encodings_from_contents uses regexps:

    charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
    pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
    xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

I'm finding cases where this is matching "random junk charset=something_weird". A real-life example is at https://www.walmart.com/ip/108356879 where I get 7 matches. The first one gives the desired "utf-8". The next five are all "UTF-8\". The last one matches on a 24711 character match and produces a 1730 character gibberish result.

Locally I've fixed this by changing the regexp patterns, replacing the first ".?" with "[^>\]?"

Would a PR implementing this in requests (and/or requests-toolbelt) be welcome?

Feb 22 '19 17:02 jbrockmendel

A PR fixing this behavior would be welcome!

Nov 28 '21 03:11 sethmlarson

requests requests copied to clipboard

utils.get_encodings_from_content regexps incorrect matches

requests
requests copied to clipboard