PyCrawler Don't parse HTML with RegEx

That's just wrong. :warning: There are xml/html parsers like lxml or beautiful soup.

See references:

http://stackoverflow.com/a/1732454/851737
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html (more technical

Nov 19 '12 20:11 schlamar

I have seen both of these links before and I'm well aware, however I do not have time to rewrite it. The one upside regex has is that it is much more portable. That doesn't necessarily outweigh the downsides to it, or the benefits of DOM parsers, but it does help when trying to stick something together in a very short amount of time just for fun (the original point of this project). If you have the time and are willing, please feel free to redo it with lxml or beautiful soup (I recommend the latter. I have used it on other things and it's wonderful) and I will gladly accept the changes. This repo does get a lot of attention. I wish I had more time to devote to it, but life is busy.

Nov 21 '12 04:11 theanti9

No, thanks, already did it :-) http://www.schlamar.org/blog/2010/04/10/python-search-engine-crawler-part-1/

FYI: This took me about 30 minutes of programming, so don't tell me about short amount of time. Doing it right doesn't have to imply that it will take more time than a dirty approach.

Nov 21 '12 12:11 schlamar