gumbo-query
gumbo-query copied to clipboard
Escape method won't match correctly
So, going over the code for parsing selectors, there's the method parseEscape(...)
which appears to be intended to convert attribute values that contain escaped characters and code points to literal values. However, doing some quick tests embedding such values in HTML, parsing then matching, it appears that gumbo_parser
directly embeds these values untouched. So if I'm right, unescaping these sequences in supplied selectors will actually cause the selectors which should match, to not match.
I'm not familiar with the GO html package, but from what searching I've done, it appears that this might have been a requirement for cascadia
(since it worked against the GO html package), but is invalid for use with gumbo_parser
. On that note, gumbo_parser
does appear to convert numeric character references and named character references, but gumbo_query
doesn't of course, so matching a selector with an id containing something like " "
will fail.
Will try to post some tests to confirm, I'm just reading over all of this figuring it out.
References: http://www.w3.org/International/questions/qa-escapes https://mathiasbynens.be/notes/css-escapes http://www.w3.org/TR/html5/syntax.html#named-character-references
I've confirmed this. Gumbo Parser directly embeds escaped character sequences without modification, yet gumbo-query converts them before building a selector. Gumbo Parser has a massive portion of source code dedicated to converting character references (char_ref.c
), but gumbo-parser does not. So you have a twofold breakdown here. Selectors built to match escaped sequences will never match, and HTML elements that use character references will never match.
The character references issue can be sorted by leveraging the character reference conversion built into gumbo parser, and the parseEscaped method needs to be dropped entirely.
void ConvertCharRefStr(std::string& stringWithCharRef)
{
std::string htmlStr = u8"<html><body><div gqconverted=\"" + stringWithCharRef + u8"\"/></body></html>";
GumboOutput* output = gumbo_parse(htmlStr.c_str());
GumboNode* nodeWithConvertedValue = static_cast<GumboNode*>(static_cast<GumboNode*>(output->root->v.element.children.data[1])->v.element.children.data[0]);
GumboAttribute* a = gumbo_get_attribute(&nodeWithConvertedValue->v.element.attributes, u8"gqconverted");
stringWithCharRef.clear();
stringWithCharRef.assign(a->value);
gumbo_destroy_output(&kGumboDefaultOptions, output);
}
Hi, @TechnikEmpire You're right about the escape. I've just port the code from go to cpp, and don't do much research about gumbo_parser. I will look into this problem and make a fix. Thanks for the report.
@lazytiger I understand. I'll see if I can help. I'm doing an unofficial fork fixing these things, but also rewriting almost everything, but that's for my own purposes. I'm injecting a bunch of third party dependencies and don't expect it to be pulled from back to here. However, I'll create a separate actual fork, throw in whatever bug fixes I come up with and you can decide if you want to pull them.
@lazytiger I lost track of most of the fixes I made. I tried to do a clone and push fixes but, like I said just lost track. I've finished my rewrite of gumbo-query and renamed it to GQ so it doesn't shadow your project. I've created a ton of tests that can be run in the browser and test programs which verify correct functionality for all types of selectors (over 40 tests). I've also implemented a tree map that makes matching extremely fast. Can find it here https://github.com/TechnikEmpire/GQ if you feel like backporting any of it. Unfortunately due to the full scale rewrite and implementation of boost deps, it wasn't possible to do an official fork and submit PR's.
@TechnikEmpire Thanks for the fixes, I'll see to it later.