tXml
tXml copied to clipboard
Fails to parse XML comment as stream
When a SVG XML file begins with an XML comment, the SVG file is not parsed (stream ends immediately):
<!-- Test -->
<svg height="200" width="500">
<polyline points="20,20 40,25 60,40 80,120 120,140 200,180" style="fill:none;stroke:black;stroke-width:3" />
</svg>
Without the comment, parsing as stream works fine:
<svg height="200" width="500">
<polyline points="20,20 40,25 60,40 80,120 120,140 200,180" style="fill:none;stroke:black;stroke-width:3" />
</svg>
that looks like a bug, thanks, I will have a test till the weekend.
I just added comment support in streams, that was missing: https://github.com/TobiasNickel/tXml/commit/5e55ea2fa167a16032b640dc31a9d551c0f3ce5b#diff-39fbd46f4381411507cbf464fd4fea02R491
And had some more tests. But first I want to help you with your task at hand. I think it is better to read the complete svg file and parse it in one go, even if the file is 20mb. If you need to do lots of processing, I would advise to use workers processes, maybe using piscina.
So, back to the comments in stream issue. I found this svg file currently only works when the svg file has a trailing comma. You have to know, that when parsing the svg not skipping the opening svg tag, will in the async loop
still only do one round, and giving you the entire SVG. If you really need comments in the file, can you add them after the opening svg-tag? When you skip the initial tag, you can get its children one by one. But when the elements are deeper nested, it is likely to make more sense to parse the complete file in one go.
The stream was implemented and debugged using the OSM planet file. there is the initial type-definition, then a genral osm
tag followed by the list of all node
and way
tags, in a pretty much flat manner. It has tag
tags nested, but that is ok. For parsing the big wikipedia export file, the structure is quite similar, some initial stuff followed by articles, followed by history information. So big database exports are usually in a pretty much flat format.
the current implementation ignores the comments. I will have some more test before publish to NPM. Thanks for opening this issue, let me know your thoughts.
@TobiasNickel: The current svgo
uses the sax
parser which is sadly not maintained anymore, and issues with characters/whitespaces/comments have accumulated, resulting in numerous issues (see the linked issues in that PR, that tries to use saxes
, but ran into performance issues).
This is the parsing part of SVGO: https://github.com/svg/svgo/blob/master/lib/svgo/svg2js.js#L26
It transforms the SVG to the svgo
AST, then svgo
plugins modify that AST and in the the third and last step the svgo
AST is stringified back to an SVG. As you can see, as sax
is used, some kind of streaming must be handled. An advantage for using streaming is that for each node some CSS-related processing is done. It is fairly quick, but doing this in the parsing step should be better for performance than building the whole SVG DOM tree and recursively traversing through it afterwards.
When you could drop-in/integrate your tXml
into the svg2js.js part (ideally by PR), those previously mentioned SVG parsing issues with sax
should be fixed and also fixed in the future (as this project is maintained), and maybe even a performance boost achieved. The stringifcation of the svgo
AST is done separately, so the parser only has to parse the SVG, no AST translation back to string is required.
It would be cool, if svgo could profit from this library. still, the streaming works much different than on other libraries. Not every tag, text, attribute is an event/item, but complete sub-trees. )
At the weekend I made two things, fork svgo and re-implement the scg2js.js. Second, an update on the parser is needed. Throw error when some tag was broken and extened parsing parsing for the doctype. With it currently I was able to fulfill all of the libs own tests, but not the plugin tests jet.
@strarsis what do you think?
for now the changes are in a branch, as i think they are breaking changes in txml. I need some more tests. now that there are a number of users for this lib, we have to be more careful. however woth this work, this parser can get more stable.
@TobiasNickel: Awesome!
- The tests run nice and fast, that's great!
- From a first glance some tests fail because text is missing and something with the line endings is amiss.
For example the test
sortDefsChildren.01
fails because the<text/>
elements are missing their textreferenced text
. And it seems that nodes that contain prefixed attributes are skipped (e.g.xlink:href
). Then there are tests that just fail because the line breaks or white spaces are slightly different, these tests can be adjusted to pass by updating the expected strings. - Concerning the backward-compatibility of this parser: Node-level streaming parsing is quite useful, maybe you could try to find a common denominator for both streaming approaches (sub-trees and individual nodes) and then offer two different streaming methods/modes?
- The following parser issues/requirements cropped up and are unresolved because of
sax
(though most of these issues are probably not a problem withtXml
):
hi, just want to say, this matter is not forgotten. Last week I was working on a blog post about Web-Push. Today, I continued a little with debugging the svgo
fork. I pass about 15 more plugin tests.
hi, little update, Today, I was able to run tall tests of svgo !!! before making a PR, I first have to make an update to txml. absolutely, some update to the parser was needed. The work of working through this, has lead to some very good changes, to txml, that I am going to publish soon as version 4.
@TobiasNickel: This is awesome! Finally some solid progress is possible with svgo.
wow, version 4 of txml is published. now it does not directly export the parse function, but an object with a parse function. this is for better compatibility between node and typescript modules.
and here is the PR to svgo: https://github.com/svg/svgo/pull/1301
only one file changed and all its tests passed.
and, as for this issue, there is now the keepComments
. it can be used for parse
and for transformStream
.
I wansn't sure about opening another issue but processing instructions also cause transformStream
to not return anything. Use this xml as example:
<?xml version="1.0" encoding="UTF-8"?>
<a>b</a>
Hi @santialbo. Sorry for by late reply, I was kind of busy during chinese new year.
The parsing of the processing instructions was quite a new feature for the main parse
method since December. for the transformStream txml would have to handle the parsing of those instructions separately.
If your file is not to big (even up to ~50-100mb and if you give node.js enough memory, even parsing 2gb would be possible, but you can imagine not optimal) it is ok to parse the complete thing at once (as long as it is only one file at a time).
usually large xml files (wikipedia export or open street map planet file) have a structure like this:
<? processing instructions ?>
<somewrapper>
<item><child>data</child></item>
<item><child>data1</child></item>
<item><child>data2</child></item>
<item><child>data3</child></item>
</somewrapper>
When parsing large files, with txml you need to first skip over the pre-ample in my example that is until after <somewrapper>
. Using a for await loop you will then get one item
at a time. Never a child
, but child will always be in the children
property of the item. To read child
and data
from the item is custom logic developed by you.
As for now, I am not sure. Are there xml files with a structure without a wrapper?:
<? processing instructions ?>
<item><child>data</child></item>
<item><child>data1</child></item>
<item><child>data2</child></item>
<item><child>data3</child></item>
Currently you can still parse the item
s with txml transformStream
, but not the processing instructions, but not the items:
const processingInstructionStream = fs.createReadStream(files.processingInstructionStream)
.pipe(tXml.transformStream(`<?xml version="1.0" encoding="UTF-8"?>`));
for await(let element of processingInstructionStream) {
console.log(element)
}
note: that the first argument to transformStream
is an offset as number of string (if its a string only its length is used).
I hope this explanation is clear. can I ask what kind of xml documents you want to process?
Hi @TobiasNickel, first thanks for this awesome lib. I have a pretty large xml file ~200MB or more and trying to use the transformStream()
, it return no elements when the file has xml
declaration at the top.
<?xml version="1.0" encoding="utf8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
const xmlStream = fs
.createReadStream('note.xml')
.pipe(txml.transformStream())
for await(let element of xmlStream) {
console.log(element)
}
Also if the file is formatted without the xml
decalration,
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
like with spaces and tabs it only works with this <
OR 0
is passed to the transformStream()
const xmlStream = fs
.createReadStream('note.xml')
.pipe(txml.transformStream('<')) // this works
const xmlStream = fs
.createReadStream('note.xml')
.pipe(txml.transformStream('0')) // this works
const xmlStream = fs
.createReadStream('note.xml')
.pipe(txml.transformStream()) // this does not work
could you explain more how to use this parameter, in what cases
the idea of the offset(the first argument is to skip the preample that is not data). in your case that would be: '<?xml version="1.0" encoding="utf8"?>'
or '<?xml version="1.0" encoding="utf8"?>'.length
.
does this work for you?