tinkr
tinkr copied to clipboard
Namespace vignette
This will address #48. Here is the rendered version via knitr::purl()
and reprex::reprex()
Draft of vingette text (updated 2021-05-28)
library("tinkr")
library("magrittr")
library("commonmark")
library("xml2")
library("xslt")
library("purrr")
#>
#> Attaching package: 'purrr'
#> The following object is masked from 'package:magrittr':
#>
#> set_names
Introduction
This document was written to address common confusions about XML namespaces and their implications in constructing XPath queries, adding new XML nodes, and converting XML to markdown. This guide is written for the user who is comfortable with XPath queries and wants to understand more about how to handle and manpiulate their XML representation of markdown.
Motivation
The underlying motivation for {tinkr} was to wrap the process of converting markdown documents to XML and back again. This process uses {commonmark} and {xml2} to translate and read in the markdown to an XML document.
xml <- commonmark::markdown_xml("## h1\n\ntext with `r 'code'`") %>%
xml2::read_xml()
xml
#> {xml_document}
#> <document xmlns="http://commonmark.org/xml/1.0">
#> [1] <heading level="2">\n <text xml:space="preserve">h1</text>\n</heading>
#> [2] <paragraph>\n <text xml:space="preserve">text with </text>\n <code xml: ...
We use the xslt
package to to the conversion from XML back to markdown.
xslt_style <- tinkr::stylesheet() %>% xml2::read_xml()
cat(xslt::xml_xslt(xml, xslt_style))
#> ## h1
#>
#> text with `r 'code'`
One of the downsides of this conversion is that commonmark provides a default namespace, which means that nodes in XPath queries must have a prefix that defines the namespace. For example, an XPath query to select all paragraphs that have executable R code looks like the following query:
xml2::xml_find_first(xml, "//d1:paragraph[d1:code[starts-with(text(), 'r ')]]")
#> {xml_node}
#> <paragraph>
#> [1] <text xml:space="preserve">text with </text>
#> [2] <code xml:space="preserve">r 'code'</code>
The reason why we add d1
is because that’s the prefix for the default
namespace in {xml2}.
xml2::xml_ns(xml)
#> d1 <-> http://commonmark.org/xml/1.0
The {tinkr} difference
The XML document that {tinkr} generates has no namespace by default because operations on an XML document without a namespace becomes easier than if there were a default or a prefixed namespace.
xml2::xml_ns_strip(xml)
xml2::xml_find_first(xml, "//paragraph[code[starts-with(text(), 'r ')]]")
#> {xml_node}
#> <paragraph>
#> [1] <text xml:space="preserve">text with </text>
#> [2] <code xml:space="preserve">r 'code'</code>
However, removing the namespace has implications for exporting XML objects because namespaces are important. For example, this document namespace-less document no longer can be converted with our XSLT stylesheet, which expects a commonmark namespace:
xslt_style <- tinkr::stylesheet() %>% xml2::read_xml()
cat(xslt::xml_xslt(xml, xslt_style))
To alleviate this, we add the namespace just before it’s converted in
tinkr::to_md()
.
xml2::xml_set_attr(xml, "xmlns", "http://commonmark.org/xml/1.0")
cat(xslt::xml_xslt(xml, xslt_style))
#> ## h1
#>
#> text with `r 'code'`
Read on to find out more about XML namespaces and their implications on your tinkering.
XML namespaces
XML namespaces are a lot like package namespaces in R: they allow you to avoid clashes of names for example, table can represent data or furniture.
By default, nodes in XML do not have namespaces unless you give them one, which means that when you use XPath search, you can use the node names by default:
d <- xml2::read_xml("<document>
<paragraph>
<text>hello there</text>
<text> ello here</text>
</paragraph>
</document>")
xml2::xml_ns(d)
#> <->
xml2::xml_find_all(d, "//document")
#> {xml_nodeset (1)}
#> [1] <document>\n <paragraph>\n <text>hello there</text>\n <text> ello ...
xml2::xml_find_all(d, "//text[contains(text(), 'hello')]")
#> {xml_nodeset (1)}
#> [1] <text>hello there</text>
However if there is a namespace added to a node, all of its descendants will inherit the namespace, which affects your XPath expressions. Below we had the namespace of commonmark to the paragraph node.
d <- xml2::read_xml("<document>
<paragraph xmlns='http://commonmark.org/xml/1.0'>
<text>hello there</text>
<text> ello here</text>
</paragraph>
</document>")
xml2::xml_ns(d)
#> d1 <-> http://commonmark.org/xml/1.0
xml2::xml_find_all(d, "//document")
#> {xml_nodeset (1)}
#> [1] <document>\n <paragraph xmlns="http://commonmark.org/xml/1.0">\n <tex ...
Using the same XPath query as before no longer works, our call to
xml2::xml_find_all()
returns nothing.
xml2::xml_find_all(d, "//text[contains(text(), 'hello')]") # does not work
#> {xml_nodeset (0)}
When a namespace is specified with xmlns=<URI>
, {xml2} assigns it a
default namespace prefix, which is d1
. Therefore editing our XPath query
like so will work:
xml2::xml_find_all(d, "//d1:text[contains(text(), 'hello')]")
#> {xml_nodeset (1)}
#> [1] <text>hello there</text>
But is it a good idea to use d1
as a namespace prefix? No, the {xml2}
documentation recommends to rename the namespace as soon as you read in a
document and use the namespace object to semantically prefix your XPath
expressions:
ns <- xml2::xml_ns(d)
ns <- xml2::xml_ns_rename(ns, d1 = "md")
ns
#> md <-> http://commonmark.org/xml/1.0
Now we can modify our XPath query to use md
as a prefix, but we also need to
supply the namespace as an argument to the command:
xml2::xml_find_all(d, "//md:text[contains(text(), 'hello')]", ns)
#> {xml_nodeset (1)}
#> [1] <text>hello there</text>
You might be wondering, why isn’t it recommended to prefix the namespace from
the start to avoid needing to rename and specify the namespace? The reason is
because the prefixed namespaces only apply to nodes with that prefix. Here’s
an example. Let’s take our previous example and modify the namespace attribute
to have an md
prefix:
dc <- as.character(d)
cat(dc <- gsub("xmlns=", "xmlns:md=", dc))
#> <?xml version="1.0" encoding="UTF-8"?>
#> <document>
#> <paragraph xmlns:md="http://commonmark.org/xml/1.0">
#> <text>hello there</text>
#> <text> ello here</text>
#> </paragraph>
#> </document>
dc <- xml2::read_xml(dc)
xml2::xml_ns(dc)
#> md <-> http://commonmark.org/xml/1.0
xml2::xml_find_all(dc, "//document")
#> {xml_nodeset (1)}
#> [1] <document>\n <paragraph xmlns:md="http://commonmark.org/xml/1.0">\n < ...
We can see that the XPath query without the prefix works.
xml2::xml_find_all(dc, "//text[contains(text(), 'hello')]")
#> {xml_nodeset (1)}
#> [1] <text>hello there</text>
However, the XPath query with the prefix no longer works.
xml2::xml_find_all(dc, "//md:text")
#> {xml_nodeset (0)}
You might be wondering, when we specified the prefix earlier with a default
namespace, the prefixed XPath query worked, but now with a namespace that
explicitly defines the prefix, that query is no longer working. Isn’t everything
below the paragraph
node in the commonmark namespace?
You might notice that we can access the document
node AND the text
node without a prefix even though the text
node is in the commonmark namespace
and the document
node is outside of that namespace. It’s because neither of
these nodes actually have a namespace!
This is demonstrated when we add a new node with the md
prefix
pgp <- xml2::xml_find_first(dc, "//paragraph")
xml2::xml_add_child(pgp, "md:text", "hello from the md namespace")
dc
#> {xml_document}
#> <document>
#> [1] <paragraph xmlns:md="http://commonmark.org/xml/1.0">\n <text>hello there ...
Now we can see that there are three text nodes, one of which has the md
namespace prefix. If we select the nodes with that prefix and without the
prefix, we will get one and two nodes, respectively.
xml2::xml_find_all(dc, "//md:text") # one node
#> {xml_nodeset (1)}
#> [1] <md:text>hello from the md namespace</md:text>
xml2::xml_find_all(dc, "//text") # two nodes
#> {xml_nodeset (2)}
#> [1] <text>hello there</text>
#> [2] <text> ello here</text>
If a namespace is defined in the document with a prefix, only nodes with that prefix are considered to be inside the namespace. This becomes important when we want to pass our XML document through a stylesheet that expects the incoming nodes to have a specific namespace, which is exactly how we transform the XML representation of markdown back to markdown.
Commonmark
The {tinkr} package streamlines the process of markdown to xml and back again.
We use commonmark::markdown_xml()
as a starting point to generate valid XML:
cat(cmk <- commonmark::markdown_xml("this is a **test**"))
#> <?xml version="1.0" encoding="UTF-8"?>
#> <!DOCTYPE document SYSTEM "CommonMark.dtd">
#> <document xmlns="http://commonmark.org/xml/1.0">
#> <paragraph>
#> <text xml:space="preserve">this is a </text>
#> <strong>
#> <text xml:space="preserve">test</text>
#> </strong>
#> </paragraph>
#> </document>
xml <- xml2::read_xml(cmk)
xml
#> {xml_document}
#> <document xmlns="http://commonmark.org/xml/1.0">
#> [1] <paragraph>\n <text xml:space="preserve">this is a </text>\n <strong>\n ...
Commonmark uses a default namespace
You can see from the commonmark output that it has a default namespace that
resolves to http://commonmark.org/xml/1.0
, which means that we need to use
the default namespace if we want to munge the data:
xml2::xml_find_all(xml, "//d1:text")
#> {xml_nodeset (2)}
#> [1] <text xml:space="preserve">this is a </text>
#> [2] <text xml:space="preserve">test</text>
Using a semantic prefix with the default namespace
To make things more semantic, we could rename the namespace to have the “md”
prefix and carry around that object. Note: an xml_namespace
object is a named
character vector, so we can create it with structure()
and use it to introduce
semantically sensible XPath queries
ns <- structure(c(md = "http://commonmark.org/xml/1.0"), class = "xml_namespace")
xml2::xml_find_all(xml, "//md:text", ns)
#> {xml_nodeset (2)}
#> [1] <text xml:space="preserve">this is a </text>
#> [2] <text xml:space="preserve">test</text>
Of course, now if we want to make any semantic XPath query, we need to include both a prefix and a namespace object.
Transforming XML to markdown with XSLT
The commonmark namespace allows us to transform our document to markdown using an XSLT stylesheet, which is—that’s right—an XML document:
sty <- xml2::read_xml(tinkr::stylesheet())
sty
#> {xml_document}
#> <stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:md="http://commonmark.org/xml/1.0">
#> [1] <xsl:import href="xml2md.xsl"/>
#> [2] <xsl:template match="/">\n <xsl:apply-imports/>\n</xsl:template>
#> [3] <xsl:output method="text" encoding="utf-8"/>
#> [4] <xsl:template match="md:emph[@asis='true']">\n <!-- \n Multiple ...
#> [5] <xsl:template match="md:text[@asis='true']">\n <xsl:value-of select="st ...
#> [6] <xsl:template match="md:link[@rel] | md:image[@rel]">\n <xsl:if test="s ...
#> [7] <xsl:template match="md:link[@anchor]">\n <xsl:if test="self::md:image" ...
#> [8] <xsl:template match="md:table">\n <xsl:apply-templates select="." mode= ...
#> [9] <xsl:variable name="minLength">3</xsl:variable>
#> [10] <xsl:variable name="maxLength">\n <xsl:for-each select="//md:table_head ...
#> [11] <xsl:template name="n-times">\n <xsl:param name="n"/>\n <xsl:param nam ...
#> [12] <xsl:template match="md:table_header">\n <xsl:text>| </xsl:text>\n <xs ...
#> [13] <xsl:template match="md:table_cell">\n <xsl:variable name="cell" select ...
#> [14] <xsl:template match="md:table_row">\n <xsl:text>| </xsl:text>\n <xsl:a ...
#> [15] <xsl:template match="md:table_row">\n <xsl:text>| </xsl:text>\n <xsl:a ...
#> [16] <xsl:template match="md:strikethrough">\n <xsl:text>~~</xsl:text>\n <x ...
Each xsl:template
node in this stylesheet matches against a specific node in
the commonmark namespace (prefix: md
) and emits text based on that node. This
allows us to write back to markdown:
cat(xslt::xml_xslt(xml, sty))
#> this is a **test**
We can in this way programatically transform the content of the markdown. In
this example, we can change the **test**
to be an inline R code chunk that
emits _test_
.
xml <- commonmark::markdown_xml("this is a **test**") %>%
xml2::read_xml()
xml2::xml_find_all(xml, "//md:strong", ns) %>%
xml2::xml_set_name("code") %>%
xml2::xml_set_text("r cat('_test_')")
#> {xml_nodeset (1)}
#> [1] <code>\n <text xml:space="preserve">r cat('_test_')</text>\n</code>
sty <- xml2::read_xml(tinkr::stylesheet())
cat(xslt::xml_xslt(xml, sty))
#> this is a `r cat('_test_')`
Perils: adding nodes
A default namespace is all fun and games until you need to add new nodes. Take
for example the situation where we want to add a code block. In commonmark, it’s
a code_block
node with an info
attribute stating the language and the text
inside is the code.
xml <- commonmark::markdown_xml("this is a **test**") %>%
xml2::read_xml()
xml2::xml_add_child(xml, "code_block", info = "{r}", "1 + rnorm(1)\n")
xml
#> {xml_document}
#> <document xmlns="http://commonmark.org/xml/1.0">
#> [1] <paragraph>\n <text xml:space="preserve">this is a </text>\n <strong>\n ...
#> [2] <code_block info="{r}">1 + rnorm(1)\n</code_block>
xml2::xml_find_all(xml, "//md:code_block", ns)
#> {xml_nodeset (0)}
sty <- xml2::read_xml(tinkr::stylesheet())
cat(xslt::xml_xslt(xml, sty))
#> this is a **test**
By all means, the node should have added correctly, but because we did not
specify a namespace, it is not recognized as part of the md
namespace even
though we added it as a child of the document. The best way to handle this
situation is to reparse the document:
xml %>%
as.character() %>%
xml2::read_xml() %>%
xslt::xml_xslt(sty) %>%
cat()
#> this is a **test**
#>
#> ```{r}
#> 1 + rnorm(1)
#> ```
We could also try adding the namespace to the node when we add it:
xml <- commonmark::markdown_xml("this is a **test**") %>%
xml2::read_xml()
xml2::xml_add_child(xml, "code_block",
xmlns = "http://commonmark.org/xml/1.0", info = "{r}", "1 + rnorm(1)\n")
xml
#> {xml_document}
#> <document xmlns="http://commonmark.org/xml/1.0">
#> [1] <paragraph>\n <text xml:space="preserve">this is a </text>\n <strong>\n ...
#> [2] <code_block xmlns="http://commonmark.org/xml/1.0" info="{r}">1 + rnorm(1) ...
xml2::xml_find_all(xml, "//md:code_block", ns)
#> {xml_nodeset (1)}
#> [1] <code_block xmlns="http://commonmark.org/xml/1.0" info="{r}">1 + rnorm(1) ...
cat(xslt::xml_xslt(xml, sty))
#> this is a **test**
#>
#> ```{r}
#> 1 + rnorm(1)
#> ```
It works, but let’s take a look at our namespaces:
xml2::xml_ns(xml)
#> d1 <-> http://commonmark.org/xml/1.0
#> d2 <-> http://commonmark.org/xml/1.0
Every node we add with an unnamed namespace adds another default and in the end, if we are doing a lot of substitution, we can end up with hundreds of namespaces.
No Namespace?
What if we just tried to use no namespace?
xml <- commonmark::markdown_xml("this is a **test**") %>%
xml2::read_xml() %>%
xml2::xml_ns_strip()
xml2::xml_add_child(xml, "code_block", info = "{r}", "1 + rnorm(1)\n")
xml
#> {xml_document}
#> <document>
#> [1] <paragraph>\n <text xml:space="preserve">this is a </text>\n <strong>\n ...
#> [2] <code_block info="{r}">1 + rnorm(1)\n</code_block>
xml2::xml_find_all(xml, "//code_block")
#> {xml_nodeset (1)}
#> [1] <code_block info="{r}">1 + rnorm(1)\n</code_block>
sty <- xml2::read_xml(tinkr::stylesheet())
cat(xslt::xml_xslt(xml, sty))
We can now add new nodes and use XPath without namespace prefixes or objects, but we have lost the ability to use our stylesheet :(
But! Maybe we can do this by adding the namespace at the last minute!
xml2::xml_set_attr(xml, "xmlns", "http://commonmark.org/xml/1.0")
xml
#> {xml_document}
#> <document xmlns="http://commonmark.org/xml/1.0">
#> [1] <paragraph>\n <text xml:space="preserve">this is a </text>\n <strong>\n ...
#> [2] <code_block info="{r}">1 + rnorm(1)\n</code_block>
cat(xslt::xml_xslt(xml, sty))
#> this is a **test**
#>
#> ```{r}
#> 1 + rnorm(1)
#> ```
Harnessing the power of namespaces
When you know that namespaces with prefixes will only respond to nodes with that prefix and all other nodes have no namespace, then you can add in nodes that can serve as anchors in your document or hiding markdown elements. Let’s say we wanted to hide all markdown elements except for code blocks. One way we could do this is to set up a namespace and add a prefix to all non-code-block nodes:
xml <- commonmark::markdown_xml("this is a **test**") %>%
xml2::read_xml() %>%
xml2::xml_ns_strip()
xml2::xml_add_child(xml, "code_block", info = "{r}", "1 + rnorm(1)\n")
xml
#> {xml_document}
#> <document>
#> [1] <paragraph>\n <text xml:space="preserve">this is a </text>\n <strong>\n ...
#> [2] <code_block info="{r}">1 + rnorm(1)\n</code_block>
# Set the prefixed namespace in your document
xml2::xml_set_attr(xml, "xmlns:tnk", "https://docs.ropensci.org/tinkr")
# Find all nodes that are not code blocks
nocode <- xml2::xml_find_all(xml, ".//*[not(self::code_block)]")
nocode
#> {xml_nodeset (4)}
#> [1] <paragraph>\n <text xml:space="preserve">this is a </text>\n <strong>\n ...
#> [2] <text xml:space="preserve">this is a </text>
#> [3] <strong>\n <text xml:space="preserve">test</text>\n</strong>
#> [4] <text xml:space="preserve">test</text>
# Change the namespace of these nodes
purrr::walk(nocode, xml2::xml_set_namespace, "tnk", "https://docs.ropensci.org/tinkr")
xml
#> {xml_document}
#> <document xmlns:tnk="https://docs.ropensci.org/tinkr">
#> [1] <tnk:paragraph>\n <tnk:text xml:space="preserve">this is a </tnk:text>\n ...
#> [2] <code_block info="{r}">1 + rnorm(1)\n</code_block>
xml2::xml_set_attr(xml, "xmlns", "http://commonmark.org/xml/1.0")
sty <- xml2::read_xml(tinkr::stylesheet())
cat(xslt::xml_xslt(xml, sty))
#> ```{r}
#> 1 + rnorm(1)
#> ```
Conclusion
While developing {tinkr} we[1] struggled a lot with understanding namespaces. This guide was our attempt at demystifying working with namespaces in {xml2}. For the casual user of {tinkr} who is interested in extracting data from markdown documents, this guide is not very useful, but we hope that this guide provies useful for the user who wants to use this for cleaning and standardizing their markdown documents.
[1] Well, mostly just Zhian.
Created on 2021-05-28 by the reprex package (v2.0.0)