html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

Text with angle brackets parsed improperly

Open ShayGuy opened this issue 6 years ago • 1 comments

Description

I am scraping a website that includes a select dropdown where the option elements are unclosed. In the inner text of one of these elements, there is text enclosed in angle brackets. HtmlAgilityPack's parser interprets this text as a start tag, containing all following text up to the next closing tag for a higher element, which happens to be the </select> tag itself. This means that all option elements from the one with the angle brackets on are parsed improperly. Link to minimal fiddle below.

(In fairness, Beautiful Soup seems to handle this page even worse -- without the closing tags, it doesn't even realize any of the option elements have ended. Just nests them until it hits </select>.)

Fiddle

https://dotnetfiddle.net/WBBwNx

ShayGuy avatar Dec 10 '19 09:12 ShayGuy

Hello @ShayGuy ,

Thank you for reporting.

We will look at this and probably apply a solution very similar to the one you suggested.

Best Regards,

Jonathan


Performance Libraries context.BulkInsert(list, options => options.BatchSize = 1000); Entity Framework ExtensionsEntity Framework ClassicBulk OperationsDapper Plus

Runtime Evaluation Eval.Execute("x + y", new {x = 1, y = 2}); // return 3 C# Eval FunctionSQL Eval Function

JonathanMagnan avatar Dec 10 '19 15:12 JonathanMagnan