camaro
camaro copied to clipboard
Issue with whitespace
I'm having an issue with whitespace and I'm wondering if Camaro is handling it as-designed, or if I should look to another package to help with this.
Given this chunk of XML (truncated, but you get the idea)
<body>
…
he conducted research in immunology and rheumatology.</p>
</sec>
</sec>
<sec disp-level="1">
<title>Eye on 45</title>
<sec disp-level="2">
<title>Protests take shape</title>
<p>As U.S. President …
Using this to construct my template…
body: "article/body",
I get this result…
he conducted research in immunology and rheumatology.Eye on 45Protests take shapeAs U.S. President
I do want to take the entire text of the body as just text, without any tags preserved. Should I expect to see a space character between where tags were stripped, or should it be concatenated like this?
Since the html data in your example looks like valid xml, it get parsed as well. So when you query article/body
, instead of getting a node with string content inside, you get a node with child node inside. get string value of that node will strip down all the tags inside it.
The proper way of putting data like this in XML is wrapping it inside CDATA like this
const transform = require('camaro')
const xml = `
<xml>
<html>
<![CDATA[
<body>
<p>
...he conducted research in immunology and rheumatology
</p>
<sec disp-level="1" />
<title>Eye on 45</title>
<sec disp-level="2" />
<title>Protests take shape</title>
</body>
]]>
</html>
</xml>
`
const result = transform(xml, {
html: 'xml/html'
})
console.log(JSON.stringify(result, null, 2))
The XML I'm working with is as proper as it's going to get. This example uses JATS, which is a highly structured and quite strict DTD used in scholarly publishing.
It's possible my example wasn't entirely clear. I do want to strip all tags. In this case, I'm only interested in the text.
he conducted research in immunology and rheumatology.Eye on 45Protests take shapeAs U.S. President
^ ^ ^
I've marked where removed tags resulted in text being concatenated. Would you consider having a single space character be placed between removed tags instead of concatenating the text, maybe as an option?
I see. You only want to place space char in place of those remove tags. For now, it's not possible because I don't check whether the path is a leaf node or contain child nodes inside.
OK, thank you for considering!