tXml icon indicating copy to clipboard operation
tXml copied to clipboard

Fails to parse XML comment as stream

Open strarsis opened this issue 4 years ago • 14 comments

When a SVG XML file begins with an XML comment, the SVG file is not parsed (stream ends immediately):

<!-- Test -->
<svg height="200" width="500">
  <polyline points="20,20 40,25 60,40 80,120 120,140 200,180" style="fill:none;stroke:black;stroke-width:3" />
</svg>

Without the comment, parsing as stream works fine:

<svg height="200" width="500">
  <polyline points="20,20 40,25 60,40 80,120 120,140 200,180" style="fill:none;stroke:black;stroke-width:3" />
</svg>

strarsis avatar Sep 07 '20 14:09 strarsis

that looks like a bug, thanks, I will have a test till the weekend.

TobiasNickel avatar Sep 08 '20 13:09 TobiasNickel

I just added comment support in streams, that was missing: https://github.com/TobiasNickel/tXml/commit/5e55ea2fa167a16032b640dc31a9d551c0f3ce5b#diff-39fbd46f4381411507cbf464fd4fea02R491

And had some more tests. But first I want to help you with your task at hand. I think it is better to read the complete svg file and parse it in one go, even if the file is 20mb. If you need to do lots of processing, I would advise to use workers processes, maybe using piscina.

So, back to the comments in stream issue. I found this svg file currently only works when the svg file has a trailing comma. You have to know, that when parsing the svg not skipping the opening svg tag, will in the async loop still only do one round, and giving you the entire SVG. If you really need comments in the file, can you add them after the opening svg-tag? When you skip the initial tag, you can get its children one by one. But when the elements are deeper nested, it is likely to make more sense to parse the complete file in one go.

The stream was implemented and debugged using the OSM planet file. there is the initial type-definition, then a genral osm tag followed by the list of all node and waytags, in a pretty much flat manner. It has tag tags nested, but that is ok. For parsing the big wikipedia export file, the structure is quite similar, some initial stuff followed by articles, followed by history information. So big database exports are usually in a pretty much flat format.

the current implementation ignores the comments. I will have some more test before publish to NPM. Thanks for opening this issue, let me know your thoughts.

TobiasNickel avatar Sep 12 '20 10:09 TobiasNickel

@TobiasNickel: The current svgo uses the sax parser which is sadly not maintained anymore, and issues with characters/whitespaces/comments have accumulated, resulting in numerous issues (see the linked issues in that PR, that tries to use saxes, but ran into performance issues).

This is the parsing part of SVGO: https://github.com/svg/svgo/blob/master/lib/svgo/svg2js.js#L26 It transforms the SVG to the svgo AST, then svgo plugins modify that AST and in the the third and last step the svgo AST is stringified back to an SVG. As you can see, as sax is used, some kind of streaming must be handled. An advantage for using streaming is that for each node some CSS-related processing is done. It is fairly quick, but doing this in the parsing step should be better for performance than building the whole SVG DOM tree and recursively traversing through it afterwards.

When you could drop-in/integrate your tXml into the svg2js.js part (ideally by PR), those previously mentioned SVG parsing issues with sax should be fixed and also fixed in the future (as this project is maintained), and maybe even a performance boost achieved. The stringifcation of the svgo AST is done separately, so the parser only has to parse the SVG, no AST translation back to string is required.

strarsis avatar Sep 12 '20 10:09 strarsis

It would be cool, if svgo could profit from this library. still, the streaming works much different than on other libraries. Not every tag, text, attribute is an event/item, but complete sub-trees. )

At the weekend I made two things, fork svgo and re-implement the scg2js.js. Second, an update on the parser is needed. Throw error when some tag was broken and extened parsing parsing for the doctype. With it currently I was able to fulfill all of the libs own tests, but not the plugin tests jet.

@strarsis what do you think?

for now the changes are in a branch, as i think they are breaking changes in txml. I need some more tests. now that there are a number of users for this lib, we have to be more careful. however woth this work, this parser can get more stable.

TobiasNickel avatar Sep 13 '20 23:09 TobiasNickel

@TobiasNickel: Awesome!

  1. The tests run nice and fast, that's great!
  2. From a first glance some tests fail because text is missing and something with the line endings is amiss. For example the test sortDefsChildren.01 fails because the <text/> elements are missing their text referenced text. And it seems that nodes that contain prefixed attributes are skipped (e.g. xlink:href). Then there are tests that just fail because the line breaks or white spaces are slightly different, these tests can be adjusted to pass by updating the expected strings.
  3. Concerning the backward-compatibility of this parser: Node-level streaming parsing is quite useful, maybe you could try to find a common denominator for both streaming approaches (sub-trees and individual nodes) and then offer two different streaming methods/modes?
  4. The following parser issues/requirements cropped up and are unresolved because of sax (though most of these issues are probably not a problem with tXml):

strarsis avatar Sep 14 '20 00:09 strarsis

hi, just want to say, this matter is not forgotten. Last week I was working on a blog post about Web-Push. Today, I continued a little with debugging the svgo fork. I pass about 15 more plugin tests.

TobiasNickel avatar Oct 01 '20 01:10 TobiasNickel

hi, little update, Today, I was able to run tall tests of svgo !!! before making a PR, I first have to make an update to txml. absolutely, some update to the parser was needed. The work of working through this, has lead to some very good changes, to txml, that I am going to publish soon as version 4.

TobiasNickel avatar Oct 31 '20 13:10 TobiasNickel

@TobiasNickel: This is awesome! Finally some solid progress is possible with svgo.

strarsis avatar Oct 31 '20 17:10 strarsis

wow, version 4 of txml is published. now it does not directly export the parse function, but an object with a parse function. this is for better compatibility between node and typescript modules.

and here is the PR to svgo: https://github.com/svg/svgo/pull/1301

only one file changed and all its tests passed.

TobiasNickel avatar Nov 15 '20 09:11 TobiasNickel

and, as for this issue, there is now the keepComments. it can be used for parse and for transformStream.

TobiasNickel avatar Nov 15 '20 09:11 TobiasNickel

I wansn't sure about opening another issue but processing instructions also cause transformStream to not return anything. Use this xml as example:

<?xml version="1.0" encoding="UTF-8"?>
<a>b</a>

santialbo avatar Feb 11 '21 08:02 santialbo

Hi @santialbo. Sorry for by late reply, I was kind of busy during chinese new year.

The parsing of the processing instructions was quite a new feature for the main parse method since December. for the transformStream txml would have to handle the parsing of those instructions separately.

If your file is not to big (even up to ~50-100mb and if you give node.js enough memory, even parsing 2gb would be possible, but you can imagine not optimal) it is ok to parse the complete thing at once (as long as it is only one file at a time).

usually large xml files (wikipedia export or open street map planet file) have a structure like this:

<? processing instructions ?>
<somewrapper>
  <item><child>data</child></item>
  <item><child>data1</child></item>
  <item><child>data2</child></item>
  <item><child>data3</child></item>
</somewrapper>

When parsing large files, with txml you need to first skip over the pre-ample in my example that is until after <somewrapper>. Using a for await loop you will then get one item at a time. Never a child, but child will always be in the children property of the item. To read child and data from the item is custom logic developed by you.

As for now, I am not sure. Are there xml files with a structure without a wrapper?:

<? processing instructions ?>
<item><child>data</child></item>
<item><child>data1</child></item>
<item><child>data2</child></item>
<item><child>data3</child></item>

Currently you can still parse the items with txml transformStream, but not the processing instructions, but not the items:

  const processingInstructionStream = fs.createReadStream(files.processingInstructionStream)
    .pipe(tXml.transformStream(`<?xml version="1.0" encoding="UTF-8"?>`));
  for await(let element of processingInstructionStream) {
    console.log(element)
  }

note: that the first argument to transformStream is an offset as number of string (if its a string only its length is used).

I hope this explanation is clear. can I ask what kind of xml documents you want to process?

TobiasNickel avatar Feb 18 '21 07:02 TobiasNickel

Hi @TobiasNickel, first thanks for this awesome lib. I have a pretty large xml file ~200MB or more and trying to use the transformStream(), it return no elements when the file has xml declaration at the top.

<?xml version="1.0" encoding="utf8"?>
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
const xmlStream = fs
    .createReadStream('note.xml')
    .pipe(txml.transformStream())

for await(let element of xmlStream) {
    console.log(element)
  }

Also if the file is formatted without the xml decalration,

<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

like with spaces and tabs it only works with this < OR 0 is passed to the transformStream()

const xmlStream = fs
    .createReadStream('note.xml')
    .pipe(txml.transformStream('<')) // this works

const xmlStream = fs
    .createReadStream('note.xml')
    .pipe(txml.transformStream('0')) // this works

const xmlStream = fs
    .createReadStream('note.xml')
    .pipe(txml.transformStream()) // this does not work

could you explain more how to use this parameter, in what cases

ambianBeing314 avatar Jun 28 '22 10:06 ambianBeing314

the idea of the offset(the first argument is to skip the preample that is not data). in your case that would be: '<?xml version="1.0" encoding="utf8"?>'

or '<?xml version="1.0" encoding="utf8"?>'.length.

does this work for you?

TobiasNickel avatar Jul 13 '22 08:07 TobiasNickel