PyCrawler icon indicating copy to clipboard operation
PyCrawler copied to clipboard

Don't parse HTML with RegEx

Open schlamar opened this issue 13 years ago • 2 comments

That's just wrong. :warning: There are xml/html parsers like lxml or beautiful soup.

See references:

  • http://stackoverflow.com/a/1732454/851737
  • http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html (more technical

schlamar avatar Nov 19 '12 20:11 schlamar

I have seen both of these links before and I'm well aware, however I do not have time to rewrite it. The one upside regex has is that it is much more portable. That doesn't necessarily outweigh the downsides to it, or the benefits of DOM parsers, but it does help when trying to stick something together in a very short amount of time just for fun (the original point of this project). If you have the time and are willing, please feel free to redo it with lxml or beautiful soup (I recommend the latter. I have used it on other things and it's wonderful) and I will gladly accept the changes. This repo does get a lot of attention. I wish I had more time to devote to it, but life is busy.

theanti9 avatar Nov 21 '12 04:11 theanti9

No, thanks, already did it :-) http://www.schlamar.org/blog/2010/04/10/python-search-engine-crawler-part-1/

FYI: This took me about 30 minutes of programming, so don't tell me about short amount of time. Doing it right doesn't have to imply that it will take more time than a dirty approach.

schlamar avatar Nov 21 '12 12:11 schlamar