node-htmlparser icon indicating copy to clipboard operation
node-htmlparser copied to clipboard

handle improperly escaped attributes

Open Swizec opened this issue 14 years ago • 1 comments

The wild internet sometimes contains weird stuff that makes this parser behave funny.

A tag such as this: <a href="#" onclick="moveAddCommentBelow("div-comment-579747", 579747, true); return false;" />

Has attributes parsed like so: { href: '#' , onclick: 'moveAddCommentBelow(' , 'div-comment-579747': 'div-comment-579747' , ',': ',' , '579747,': '579747,' , 'true);': 'true);' , return: 'return' , 'false;': 'false;' }

Swizec avatar Sep 25 '10 15:09 Swizec

Yeah, I know =) It is one area that I am looking to make more forgiving. It has caused me problems too when scraping certain sites.

There is a rewrite (2.0) in the works that will take these format errors into account.

tautologistics avatar Oct 04 '10 13:10 tautologistics