parsel icon indicating copy to clipboard operation
parsel copied to clipboard

Add rel-id xpath ext

Open immerrr opened this issue 8 years ago • 3 comments

This PR adds an xpathfunc that performs relative id lookups.

There are two ways of doing those:

  • sel.xpath('id("foo")') under the hood performs a dictionary lookup and thus is blazingly fast, however there's no way to limit the nodeset to search in.
  • with sel.xpath('//*[@id="foo"]') one can limit the nodeset the way they like, however it has to traverse all the matching nodes, and thus is a lot slower

rel-id function, presented in this PR, attempts to achieve some middle ground: it does the id lookup under the hood, but then checks the result to be in the specified nodeset, i.e. all following statements return the same results:

sel.xpath('rel-id("foo", //div)')
sel.xpath('//div').xpath('rel-id("foo")')
sel.xpath('id("foo")[ancestor::div]')
sel.xpath('id("foo")[set:intersection(ancestor::*, //div)]')
sel.xpath('//div/*[@id="foo"]')

Naturally, it's a Python-level xpathfunc, so "native" solutions that involve id and ancestor are faster, but it's still more performant than [@id="foo"] and .css("div #foo") (that expands to [@id="foo"]):

sel.css("#masthead")                                                    0.971  1.000
sel.xpath("//*[@id='masthead']")                                        1.186  1.221
sel.xpath("id('masthead')")                                             0.032  0.033
sel.xpath("rel-id('masthead')")                                         0.051  0.053


sel.css("#shell #masthead")                                             2.162  1.000
sel.xpath("//*[@id='shell']//*[@id='masthead']")                        2.257  1.044
sel.xpath("id('shell')//*[@id='masthead']")                             1.147  0.531
sel.xpath("id('masthead')[ancestor::*[@id='shell']]")                   0.039  0.018
sel.xpath("id('masthead')[set:intersection(ancestor::*, id('shell'))]")  0.037  0.017
sel.xpath("rel-id('masthead', id('shell'))")                            0.055  0.025
sel.xpath("id('shell')").xpath("rel-id('masthead')")                    0.090  0.041


sel.css("div #masthead")                                               12.127  1.000
sel.xpath("id('masthead')[ancestor::div]")                              0.035  0.003
sel.xpath("rel-id('masthead', //div)")                                  0.558  0.046
sel.xpath("//div").xpath("rel-id('masthead')")                         17.939  1.479


sel.xpath("id('masthead')[set:intersection(ancestor::*, (//div|//span))]")  0.248  1.000
sel.xpath("rel-id('masthead', (//div|//span))")                         0.670  2.700
sel.xpath("//div|//span").xpath("rel-id('masthead')")                  19.825 79.845

The benchmark is available here. Also, sel.xpath("//div").xpath("rel-id('masthead')") and sel.xpath("//div|//span").xpath("rel-id('masthead')") are very slow because of the number of items for which rel-id is invoked.

One particular situation when rel-id is helpful, is when you pre-select a subset of the document and then look in its descendants:

sel2 = sel.xpath('id("foo")')
sel2.xpath('rel-id("bar")')

The sel2.xpath('id("bar")[set:intersection(ancestor::*, .)]') approach won't work here, because the dot inside the square brackets already means id("bar") rather than id("foo").

immerrr avatar Sep 06 '17 18:09 immerrr

Codecov Report

Merging #100 into master will not change coverage. The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master   #100   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           5      5           
  Lines         248    265   +17     
  Branches       46     51    +5     
=====================================
+ Hits          248    265   +17
Impacted Files Coverage Δ
parsel/xpathfuncs.py 100% <100%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 68d64db...eb7edc4. Read the comment docs.

codecov[bot] avatar Sep 06 '17 18:09 codecov[bot]

Another slightly subtle application of the context node being current node by default:

sel.xpath('//div[rel-id("some-id")]')

which means select a div, that contains an element with id="some-id"

immerrr avatar Sep 06 '17 19:09 immerrr

If we decide to merge this, we should probably update the documentation first.

Gallaecio avatar Aug 19 '19 14:08 Gallaecio