w3lib icon indicating copy to clipboard operation
w3lib copied to clipboard

It's not a good idead to parse HTML text using regular expressions

Open starrify opened this issue 8 years ago • 5 comments

In w3lib.html regular expressions are used to parse HTML texts:

_ent_re = re.compile(r'&((?P<named>[a-z\d]+)|#(?P<dec>\d+)|#x(?P<hex>[a-f\d]+))(?P<semicolon>;?)', re.IGNORECASE)
_tag_re = re.compile(r'<[a-zA-Z\/!].*?>', re.DOTALL)
_baseurl_re = re.compile(six.u(r'<base\s[^>]*href\s*=\s*[\"\']\s*([^\"\'\s]+)\s*[\"\']'), re.I)
_meta_refresh_re = re.compile(six.u(r'<meta\s[^>]*http-equiv[^>]*refresh[^>]*content\s*=\s*(?P<quote>["\'])(?P<int>(\d*\.)?\d+)\s*;\s*url=\s*(?P<url>.*?)(?P=quote)'), re.DOTALL | re.IGNORECASE)
_cdata_re = re.compile(r'((?P<cdata_s><!\[CDATA\[)(?P<cdata_d>.*?)(?P<cdata_e>\]\]>))', re.DOTALL)

However this is definitely incorrect when it involves commented contents, e.g.

>>> from w3lib import html
>>> html.get_base_url("""<!-- <base href="http://example.com/" /> -->""")
'http://example.com/'

Introducing "heavier" utilities like lxml would solve this issue easily, but that might be an awful idea as w3lib aims to be lightweight & fast.
Or maybe we could implement some quick parser merely for eliminating the commented parts.

Any ideas?

starrify avatar Aug 11 '16 19:08 starrify

@starrify I believe the goal was indeed speed; also, these regexes may take e.g. only first 4096 bytes of the page, without the rest. Ideas about a proper solution are welcome! It should

a) be almost as fast as these regexes; b) work on arbitrarily truncated HTML files.

kmike avatar Aug 12 '16 04:08 kmike

Just to add my 2 cents and to bump this issue,

Indeed, regex parsing of the html seems to miss some things and as others have said before me, commented out base tags is one example. In some cases those commented out base tags point to different websites altogether. So for me the question is speed vs accuracy. One can fork w3lib or override scrapy/utils/response.py: get_base_url() and make it call an also overridden w3lib/html.py:get_base_url() with the addition of @starrify

devspyrosv avatar May 04 '18 12:05 devspyrosv

Another issue here is that it does not ignore the commented tags. For example we may have a commented base tag like:

 <!--<base href="http://127.0.0.1" />-->
 <base href="http://www.example.com/" />

Of course according to _baseurl_re it will take the commented one.

Any ideas on how can we solve this?

botzill avatar Jan 26 '19 12:01 botzill

Hello, just for your reference.

I recently tested w3lib's prescan against 500 most popular websites. I found three bugs (or different behaviors from html5 spec).

books.google.com: <meta http-equiv="content-type"content="text/html; charset=UTF-8"> (no space between attributes)

mega.nz: <meta http-equiv="Content-Type" content="text/html, charset=UTF-8" /> (comma, not semicolon)

stuff.co.nz: doc.write('<body onload=[...] <meta charset="utf-8"/> (matching '<body')

validator's, jsdom's and html5lib-python's prescan parsers get encoding successfully.

...I don't know it is a good idea to fix these and make prescan regex even more complex.

openandclose avatar Oct 26 '19 15:10 openandclose

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.

fonkwe avatar Feb 28 '20 13:02 fonkwe