cssselect
cssselect copied to clipboard
Incorrect use of XPath name() function
The use of the name() function for matching tags breaks with documents that have a default namespace or multiple namespace prefixes mapping to the same namespace.
For example,
The CSS selector
h|p + h|p
becomes
descendant-or-self::h:p/following-sibling::*[name() = 'h:p' and (position() = 1)]
When this query is run on a XHTML document it will produce no matches, because the name() function returns "p". Similarly if it is run on a document that defines the XHTML namespace with a prefix other than h it will fail.
A possible solution is to have the css_to_xpath function take a namespaces argument that contains a mapping of prefixes to URIs and then use local-name() and namespace-uri() instead of name(). The argument can default to None, in which case it can use the present behavior, for backward compatibility.
See http://lenzconsulting.com/namespaces-in-xslt/#perils_of_the_name_function for more details on the problems caused by using the name() function.
Hi,
Sorry for the delay to respond. I just confirmed that the name() function in lxml uses the prefix from the document source rather than the namespace mapping of the XPath expression.
So there is a bug, but namespace handling in cssselect is generally broken. See #9. It needs a rewrite and I know how to do it but it’s just low priority for me right now. Until I get to it, anyone willing to give it a go is welcome to do so.