node-htmlparser
node-htmlparser copied to clipboard
handle improperly escaped attributes
The wild internet sometimes contains weird stuff that makes this parser behave funny.
A tag such as this: <a href="#" onclick="moveAddCommentBelow("div-comment-579747", 579747, true); return false;" />
Has attributes parsed like so: { href: '#' , onclick: 'moveAddCommentBelow(' , 'div-comment-579747': 'div-comment-579747' , ',': ',' , '579747,': '579747,' , 'true);': 'true);' , return: 'return' , 'false;': 'false;' }
Yeah, I know =) It is one area that I am looking to make more forgiving. It has caused me problems too when scraping certain sites.
There is a rewrite (2.0) in the works that will take these format errors into account.