node-xml2js icon indicating copy to clipboard operation
node-xml2js copied to clipboard

Parse a section at a time - very large file

Open psmod2 opened this issue 6 years ago • 5 comments

Hi,

Sorry if I'm asking this question in the wrong place:

I have a very large XML file - 10GB - and I'm using the following code to parse it:

function parsexml(callback) {
    fs.readFile(__dirname + '/sample_larger.xml', function (err, data) {
        parser.parseString(data, function (err, result) {
            pData = result
            console.log('====> Done Parsing XML');
            
            callback(null)
        })
    })
}

Then I have second function addVertices which uses pData - to make a Gremlin query to add that data into my Azure Cosmos DB.

My question is - copying the 10GB into that variable pData - seems like a bit of a waste - is it possible instead to parse one section at a time - for example if I specified the XML header I'm after.

Assume my xml looks something like:

<songs>
  <song>
   //details I want
  </song>
</songs>

Is there something like:

function parsexml(callback) {
    fs.readFile(__dirname + '/sample_larger.xml', function (err, data) {
        parser.parseSection("song", function (err, result) {

            //do my gremlin query into my Cosmos DB
            
            callback(null)
        })
    })
}

Any advice/help appreciated.

Thanks

psmod2 avatar May 09 '18 04:05 psmod2

copying the 10GB into that variable pData - seems like a bit of a waste

Remember, when you do pData = result, you're not copying anything - pData then holds a reference to the same object as result. both of them are pointing to the same object

andersponders avatar May 10 '18 15:05 andersponders

Right - so nothing really to worry about.

Nevertheless, is it possible to parse a block of XML at a time - based on a tag I specify?

psmod2 avatar May 11 '18 08:05 psmod2

@psmod2 I had the same problem. I've made a function that does this and follows the xml2js format (it's pretty slow, but doesn't run out of memory):

https://github.com/tfso/njs-tfso-xml/blob/master/src/streamParse.js Usage: https://github.com/tfso/njs-tfso-xml/blob/master/test/testStreamParse.js

It returns in the same format as the xml2js version here:

https://github.com/tfso/njs-tfso-xml/blob/master/src/parse.js

Would be awesome to get this functionality in xml2js as well.

magnusjt avatar May 30 '18 07:05 magnusjt

Also see https://github.com/Leonidas-from-XIV/node-xml2js/issues/137 and https://github.com/Leonidas-from-XIV/node-xml2js/issues/102

jcsahnwaldt avatar Jun 15 '18 06:06 jcsahnwaldt

@psmod2 any workaround?

at4446 avatar May 13 '21 04:05 at4446