dom
dom copied to clipboard
Proposal - Update XPath to (at least) v2.0
While the latest recommendation is v3.1, most questions related to XPath seems to miss Regular Expressions, introduced in v2.0 which is nearly a 10 years ago recommendation.
However, all browsers support only XPath v1.0 from 1999.
Background
Widely adopted in 2007 by popular frameworks such as Dojo, Prototype, or Mootools, the XPath language is an extremely powerful tool to query and crawl the DOM in all its axes, hence superior than CSS, and able to unleash proposed selectors already, such as :has(...) even in its version 1.
// CSS container:has(child)
// XPath since 1999
'.//container[count(.//child) > 0]'
But this is only scratching the surface of operations that XPath can do, as opposite of querying via CSS, check surrounding DOM nodes via JS, check results are valid (i.e. if (child.closest('container')) plus there's no way to target text nodes or even comments.
Proposal
Provide at least the method matches(RegExp, flag) to the current XPath 1.0 (let's call it 1.1) or provide at least v.2 of this old but gold standard to crawl any DOM tree, as if it's still updated and useful for back end crawlers, it's unclear why the first class citizen JS should not benefit from its potentials, way superior than CSS selectors, and less error prone, as filtering and complex searches can be done directly through document.evaluate.
Thanks in advance for considering this improvement, as I'm sure once RegExp will be in, the usage of XPath for complex SPA/PWA pages would flourish again in either libraries, web components, or the Web in general.
Per https://www.chromestatus.com/metrics/feature/popularity it does seem that about 1-2% of page views end up using XPath, so maybe it's worth considering, but I wouldn't really want to do anything here until #67 is fully settled, including tests. XPath has been a long neglected part of the platform, we should standardize what we have first before considering additions.
@annevk thanks for pointing me out #67 ... I think I've searched in the HTML repo and not here, otherwise bumping the XPath version might be part of #67 too, imho, as once there's agreements for settling it, there might be agreements on what should run underneath, right? If you feel like that's the case, feel free to close this issue, and I'll keep watching/following the other one.
It's been like a decade so I might remember wrongly, but I don't think XPath 2.0 is backwards compatible. That doesn't mean we couldn't do compatible extensions to 1.0, but I'm not sure what the appetite is for that.
I'm really after having matches(RegExp, flag) which is currently the biggest missing feature in XPath 1.0, available since XPath 2.0 ... however, if we focused on a new XPath API, I don't see backward compatibility as an issue.
Chrome is not interested in this. The XML parts of our pipeline are in maintenance mode and we would love to eventually deprecate and remove them, or at least replace them with something that generates less security bugs. Increasing the capabilities of XML in the browser runs counter to that goal.
or at least replace them with something that generates less security bugs
if replaced, since work would need to be done regardless, what are the security implications of having matches in, if I might ask?
Also worth mentioning that usage increased in the last years so that removing it looks indeed like a breaking change ... we just started using XPath extremely successfully in many occasions, having that fully removed would break many things so I hope there's room for changes but no deprecation ... it's super powerful as query language and it can provide things CSS might never have for perf or other reasons.
"I don't think XPath 2.0 is backwards compatible."
This is not true, at least in the sense I would understand it, i.e. that an XPath 2.0 (or indeed 3.1, there is no XPath 2 :) ) processor will happily run an XPath 1.0 statement, and return the same nodes as an XPath 1.0 processor.
"The XML parts of our pipeline are in maintenance mode and we would love to eventually deprecate and remove them, or at least replace them with something that generates less security bugs. Increasing the capabilities of XML in the browser runs counter to that goal."
Assuming for a moment that increasing XML capabilities "generates [...] security bugs" (I am not convinced), this is a proposal for querying the HTML DOM with XPath, not XML. (Thanks to @gimsieke for pointing this out!)
By "XML parts of our pipeline" I mean "everything implemented using libxml and libxslt".
Deprecating/replacing libxml and libxslt would be a prerequisite of updating support to XPath v >1.
So Chrome should be behind such a move, right?
I can tell this is not going to be a productive conversation, as folks are intent on playing word games to try and pretend Chrome has a different stance than we do. As such, I won't be participating in this thread further. I think I've made our position clear.
If supporting XPath >=2.0 would mean everyone needs a completely new implementation, then wouldn't it be less work overall to just continue to improve CSS selectors to support the missing features?
@domenic I don't think you are being fair to @yamahito's point. Just because XPath has "X" in the name doesn't necessarily mean it needs to have anything to do with XML.
Sorry to make you feel I am playing word games, and that my joke has made you throw your toys out of the pram, but there was a serious point here, which I don't think you've addressed.
You and the OP sort of have the same problem: libxml and libxslt have not been updated to work with updated specifications for a very long time.
If you want a productive suggestion, how about the Saxon-HE/C library as a potential alternative? https://www.saxonica.com/saxon-c/index.xml
The products Saxon-PE/C and Saxon-EE/C are commercial products, and require a license key.
I mean maybe Michael Kay would have some idea whereby this could be doable and he would find it reasonable, but I think this makes it difficult for some browsers.
I personally would love if I saw XPath getting some love, so don't take my comment as negative.
However, Saxon-HE/C is open source: you wouldn't have support for all features (e.g. schema awareness), but I don't think those would be missed for this purpose.
Of course, there may be other reasons why it's not doable (licensing issues), and I'm not qualified to comment on implementation. I certainly don't want to talk for Chrome, despite aspersions to the contrary. I just want to point out that the underlying issue is the use of a library many years out of date, but that said library does not reflect on XPath as a technology.
@domenic I am not sure that "folks" included me (but I guess so ...)
I can tell this is not going to be a productive conversation, as folks intent are playing word games
I honestly had the feeling there was no room for any conversation, after your first reply:
Chrome is not interested in this. The XML parts of our pipeline are in maintenance mode and we would love to eventually deprecate and remove them
although, this sentence is both not exactly what I've proposed, but also scary, 'cause SVG, as far as I know, is still part of the XML namespace/pipeline, and announcing that anything XML is going to be deprecated and removed is concerning, imo.
I also think it's clear that developers knowing XPath, and its potentials, probably are not using it daily due its lack of improvements since 1999, so that asking why, where, or what, looks like a normal conversation to me, but "dropping the bomb and the mic" at the same time feels a bit "off", imho, but if there's anything I've said that made you put me in the "folks that play word games" category, I apology, 'cause even if I'm not sure where I gave you that feeling, it surely wasn't my intent.
I hope that the idea to improve XPath to let Web developers fulfill any requirement not satisfied by current CSS offer would be considered at least by other vendors, specially after reading that XPath has apparently security implications, while it's still a W3C recommendation ... it took much less to deprecate SQLite, and no security issue was obvious at that time, it's weird something known as insecure has been kept for 20 years in the platform and never got a chance to be updated.
@annevk XPath 2 (and, more to the point these days, 3.1) are highly backward compatible with XPath 1. There are some differences. Example: in XPath 1, the string value of a sequence is the string value of the first item in the sequence; that was crazy and caused lots of bugs in people's XPath expressions.
The XPath 2 and 3 specs include notes for people implementing XPath 2 and 3 on how to handle those cases. They are very small edge cases & many are unlikely to apply to Web browser usage anyway.
Possible implementation approaches include (1) make a standard API that includes the desired XPath version; this is badly needed in any case... (2) use a JavaScript-based implementation (see e.g. frameless.io), (3) write or reuse a C/rust/C++ one, most likely starting with an XQuery implementation as that's an extension of XPath (XQuery 1 extends XPath 2, confusingly; XQuery 3.1 extends XPath 3.1).
Where XPath 1 was based on node lists, XPath 2 moved to being based on sequences; it's much more powerful for users, and a lot of things that were tricky became a lot clearer, but the underlying code is likely very different.
A CSS xpath('expr' [, version]) function would be super useful e.g. in the content property, as it can do string processing on text in the document - even if only in the "slow" profile of CSS.
@WebReflection the security issues in XPath are that there are functions (starting in XPath 1) that allow file access. The same security issues that XHTTPRequest has apply. There are also common extensions in XPath implementations to allow extended file access, but those make no sense in a Web browser - see e.g. expath.org. In XPath 3 it's possible to write recursive functions, as with JavaScript, so you could create infinite loops, and an implementation needs to detect this. There's also the possibility - again as with JavaScript - of building up variables, e.g. with the string concatenation operator || like this:
let $a := "socks socks socks socks, $b := $a || $a || $a || $a, $c := $b || $b || $b || $b return $c || $c || $c || $c which makes lots of socks. Or you can write string-join( (1 to 99999999), ", ")
to make "1, 2, 3, ..."
As with JavaScript, a sensible workaround is limits on variable size & sequence length. So the security issues are known and manageable.
But that's different fromwhat @domenic meant, which is that there were security issues in the XML pipeline - that is, in the C libraries they have been using, which are large, complex, and hard to fix.
Yes, CSS could be extended to be comparable - e.g to be able to do string matching & processing on text content, date/time arithmetic, joins, union/intersection, and so forth. It'd be a lot of work, although just adding matches() and replace() would go a long way -
td.matches("^-\d+") { color: red; }
(to invent a syntax in selectors)
although,
span.price.xpath(. gt 0 and . lt 100 and not(preceding-sibling::span[. = 'special'])) { color: green; }
would go further. I'd guess that in the next 10 or 15 years CSS will get there; in the meantime, custom CSS functions and selectors may give a way to do some of the things you can do with XPath, albeit more slowly.
@liamquin thanks a lot for the clarification, and yes, that makes sense. However, if XPath 3 is more problematic than 2, in terms of possible footgun within the parsing and features, I think having v2 available in JS would already be a killer feature compared to 1, and since nobody wants new footguns in JS, upgrading to the least problematic version that provides matches and replace (which is also in v2 IIRC), would enable a whole new world of possibilities that fit into a well known selector, instead of spanning through some CSS selector, plus JS checks, plus anything else that might result in more errors than features, for the platform.
Personally, as one of the authors of a free open source XPath 3.1 implementation (https://github.com/FontoXML/fontoxpath), I do not really see the point in shipping XPath 3.x or 2.0 in the browser.
Rather, I would prefer to see a way to plugin into the CSS engine to use XPath in CSS, so that we can do what @liamquin described, but in a more flexible way. There will be many performance concerns over there, but those must be manageable in some way.
@DrRataplan unless you are thinking about exposing XPath through querySelector/All, I am not sure how that would cover the crawling/addressing use case, but as your solution would still mean updating XPath and hooking it into CSS, I don't think your idea would take less time than simply updating XPath.
Also worth reminding that updating XPath, as proposed in here, has nothing to do with styling, as any live styling through XPath will make pages likely very slow, otherwise we would already have :has(...) selector widely implemented.
I do not think @domenic meant removal of XML API
Deprecate, and consider removing, XSLT
The consensus last time we considered this was that xml and xslt are too important for enterprise and we cannot remove them from the platform. Closing this bug to match that reality. We'll open a new bug if we ever decide to do this. [1] (Feb 22, 2019)
— https://bugs.chromium.org/p/chromium/issues/detail?id=514995
That said I have one example of XML API state. Have you known DOMParser parsing text/xml is slower than text/html?
We have querySelector/querySelectorAll, over the years it adopted many XPath selectors, yet there is no queryXPath/queryXPathAll and its polyfill is just a few lines:
XPathResult.prototype[Symbol.iterator] = function *() {
let next;
while (next = this.iterateNext()) {
yield next;
}
}
Document.prototype.queryXPathAll = function(expression, ...args) {
return [...this.evaluate(expression, this, args)]
}
Element.prototype.queryXPathAll = function(expression, ...args) {
return [...this.ownerDocument.evaluate(expression, this, args)]
}
There was a proposal, waits for #67, closed.
jQuery popularized CSS selectors. Somehow there is not much XPath, XPath 2.0, XPath 3.0 activity on the web. It would be great if its proponents described how it helps them. Personally I use XPath to query text nodes and as :has replacement
//text()[last()]
//a[text() = 'foo']
//a[img]
I do not think Web developers know and use count, etc. XPath 2.0 extends it, feels a lot like SQL:
//*[tokenize(@class, ' ') = 'foo']
//time[fn:year-from-date(xs:date(@datetime)) = 2020]
I would prefer Invisible XML approach
//*[id = 'foo']
//p[class = 'bar']
//p[lang/en/us]
//date[datetime/year = '2020']
//a[href/host = 'example.com']
//span[xstyle/color = 'blue']
emulated with
<p><id>foo</id></p>
<p><class>bar</class></p>
<p><lang><en><us></us></en></lang></p>
<date><datetime><year>2020</year></datetime></date>
<a><href><host>example.com</host></href></a>
<span><xstyle><color>blue</color></style></span>
(<style> is CDATA, I use <xstyle> instead)
Each node node knows its type, parses underlying mini language and presents as if it was nodes.
@sergeykish with XPath I can select even attribute nodes and/or text nodes, and this is gold for libraries based on template literals ... as example, this single query //*/@*[.="${uid}"]|//*/text()[contains(.,"${uid}")] lets me remove a tree walker with checks all over for the attribute content or text content, XPath does that in one line, because it's a language born to query the tree, not to style it. I wouldn't mind having queryXPath and/or queryXPathAll, if that helps adding matches, but I really hope matches can be added as amend of XPath 1.
The rest of the functions are also well known, and there are cheatsheets that help with it too: https://devhints.io/xpath#class-check
Why 2.0? Why not the latest version?
@sirinath apparently there's an agreement among XPath users that v1.0 is the right version to use and eventually new features should be implemented on top of v1.0, and to be honest, the only feature I really miss, and so do others, is the RegExp functionality, which together with current XPath 1.0 offer, would be already a huge upgrade in possibilities.
As apparently nobody wants to touch this part of the Web anyhow, we should try to understand if bringing just that would be possible, or if we could just close this proposal as not accepted and move forward.
I don't think that using Hacker News comments is good for determining that there is a consensus that XPath 1.0 is the right version to use/build on. If you wanted to do something like that you would need to do a survey of companies and hobbyists to see what stacks they are using and if they would use XPath 2.0/3.0/3.1 features if they were available on those stacks (including on web browsers, e.g. when testing via Selenium). Personally, I like the changes that XPath 2.0 made to the language, as it tidied up several things like not being able to do *:hr or processing-instruction(name) in XPath 1.0.
FWIW, I have recreated the XPath 1.0 grammar using the XPath 2.0 names and structures at https://rhdunn.github.io/xquery-intellij-plugin/specifications/XPath%201.0%20as%202.0%20EBNF%20Grammar.html. It is 47 EBNF symbols, compared to XPath 2.0's 82, 3.0's 108 and XPath 3.1's 126. That document also describes the grammar differences between XPath 1.0 and XPath 2.0.
@rhdunn
I don't think that using Hacker News comments is good for determining that there is a consensus that XPath 1.0 is the right version to use/build on.
absolutely, and I haven't used the consensus word, I've just found interesting comments from various people actually using, and appreciating, what XPath brings to the plate, and many said 2 or even 3 are too much to implement and possibly problematic, but few said it should be relatively easy to add RegExp on top of the current implementation only, which is 1.0.
As this issue mention upgrade to 2.0, that's the ideal dream/goal, but since vendors already stated they don't think this would ever happen, they have no interest, or it's complicated, then I'm just saying I personally miss RegExp, as I think that'd be a huge step forward already in scraping and querying possibilities.
Hey, this issue is trending in HN https://news.ycombinator.com/item?id=24959588 - probably a good idea to lock it for a bit to reduce the amount of noise.
Also, knowing some of the people involved - I think discussion here isn't too great:
- Having a lot of people show up downvote/upvote things isn't great. It's a shame GitHub doesn't let repo owners turn that feature on/off. It creates a feeling of maintainers being attacked.
- "Chrome is not interested in this" is a perfectly fine way to respond. You may not like the fact Domenic said that nor that it's Chrome's position but Chrome is allowed to have that position. I think a productive follow up would have been "what would it take for you to reconsider?" or "how can we help with Chrome's concerns?" or something similar.
- Telling Chrome about libraries ( like saxon-c ) doesn't help. It's asking them to allocate a significant amount of work to refactor a feature they are not interested in maintaining to begin with but have to. It'll be a very hard sell.
- Props to Andrea for trying to engage constructively and actually explain why XPath2 would be useful for Chrome and what capabilities it adds to the web platform that may ask Chrome to support it. Also props to Liam on explaining why that's good.
I am not sure why no one brought up the fact that XPath 3 implementations exist in userland (this seems to be the most popular one) but they are not popular. So XPath 3 does not add capabilities to the web platform since it's possible in userland and is not popular and does not fix the issues with the existing APIs since it can't replace it because of compatibility.
If you want to engage (constructively) with Chrome on this - you need to look at their perspective and explain how an investment into XPath 3.1 aligns with their goals. For example - get someone to sponsor work that reduces the existing technical debt significantly while adding the API. TBH if I were Chrome I'd likely still not go for it because of their perspective.
Before the window here closes, as someone who isn’t here because of HN :), I thought it might be helpful to add a data point. I’ve also used XPath for the exact same purpose as @WebReflection. It seems very suited to this. (And — far more niche — I’ve also employed it when processing WOFF metadata.) I found the existing implementation adequate for both tasks, but figured it might still be useful for implementers to know that the xpath-for-template-substitutions pattern isn’t a one-off.