Add rel-id xpath ext
This PR adds an xpathfunc that performs relative id lookups.
There are two ways of doing those:
-
sel.xpath('id("foo")')under the hood performs a dictionary lookup and thus is blazingly fast, however there's no way to limit the nodeset to search in. - with
sel.xpath('//*[@id="foo"]')one can limit the nodeset the way they like, however it has to traverse all the matching nodes, and thus is a lot slower
rel-id function, presented in this PR, attempts to achieve some middle ground: it does the id lookup under the hood, but then checks the result to be in the specified nodeset, i.e. all following statements return the same results:
sel.xpath('rel-id("foo", //div)')
sel.xpath('//div').xpath('rel-id("foo")')
sel.xpath('id("foo")[ancestor::div]')
sel.xpath('id("foo")[set:intersection(ancestor::*, //div)]')
sel.xpath('//div/*[@id="foo"]')
Naturally, it's a Python-level xpathfunc, so "native" solutions that involve id and ancestor are faster, but it's still more performant than [@id="foo"] and .css("div #foo") (that expands to [@id="foo"]):
sel.css("#masthead") 0.971 1.000
sel.xpath("//*[@id='masthead']") 1.186 1.221
sel.xpath("id('masthead')") 0.032 0.033
sel.xpath("rel-id('masthead')") 0.051 0.053
sel.css("#shell #masthead") 2.162 1.000
sel.xpath("//*[@id='shell']//*[@id='masthead']") 2.257 1.044
sel.xpath("id('shell')//*[@id='masthead']") 1.147 0.531
sel.xpath("id('masthead')[ancestor::*[@id='shell']]") 0.039 0.018
sel.xpath("id('masthead')[set:intersection(ancestor::*, id('shell'))]") 0.037 0.017
sel.xpath("rel-id('masthead', id('shell'))") 0.055 0.025
sel.xpath("id('shell')").xpath("rel-id('masthead')") 0.090 0.041
sel.css("div #masthead") 12.127 1.000
sel.xpath("id('masthead')[ancestor::div]") 0.035 0.003
sel.xpath("rel-id('masthead', //div)") 0.558 0.046
sel.xpath("//div").xpath("rel-id('masthead')") 17.939 1.479
sel.xpath("id('masthead')[set:intersection(ancestor::*, (//div|//span))]") 0.248 1.000
sel.xpath("rel-id('masthead', (//div|//span))") 0.670 2.700
sel.xpath("//div|//span").xpath("rel-id('masthead')") 19.825 79.845
The benchmark is available here. Also, sel.xpath("//div").xpath("rel-id('masthead')") and sel.xpath("//div|//span").xpath("rel-id('masthead')") are very slow because of the number of items for which rel-id is invoked.
One particular situation when rel-id is helpful, is when you pre-select a subset of the document and then look in its descendants:
sel2 = sel.xpath('id("foo")')
sel2.xpath('rel-id("bar")')
The sel2.xpath('id("bar")[set:intersection(ancestor::*, .)]') approach won't work here, because the dot inside the square brackets already means id("bar") rather than id("foo").
Codecov Report
Merging #100 into master will not change coverage. The diff coverage is
100%.
@@ Coverage Diff @@
## master #100 +/- ##
=====================================
Coverage 100% 100%
=====================================
Files 5 5
Lines 248 265 +17
Branches 46 51 +5
=====================================
+ Hits 248 265 +17
| Impacted Files | Coverage Δ | |
|---|---|---|
| parsel/xpathfuncs.py | 100% <100%> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 68d64db...eb7edc4. Read the comment docs.
Another slightly subtle application of the context node being current node by default:
sel.xpath('//div[rel-id("some-id")]')
which means select a div, that contains an element with id="some-id"
If we decide to merge this, we should probably update the documentation first.