sheetjs XLSX.write() taking too long

XLSX.write() taking too long

Open irfanyounas opened this issue 10 years ago • 40 comments

Hi, I have tried to write 3 columns and N number of rows, below we can see the time taken (on my i3 machine):

N --------------- TIME (s) 10000 --------------- 9 20000 --------------- 35 30000 --------------- 90 65000 --------------- 453

We can see that as we increase the number of rows, the time taken by the write() function increases dramatically (its not linear), its taking too long for 65000 records. Can we have some other way which is efficient?

Jun 28 '14 22:06 irfanyounas

I am not happy with the performance at all (granted, our first goal was to be correct). There are quite a few slow/inefficient operations that can be improved.

When I last investigated, the runtime was dominated by functions from the ZIP library (we are currently using https://github.com/Stuk/jszip), so I'm working on a new component right now (https://github.com/SheetJS/js-crc32 and https://github.com/SheetJS/js-adler32 are the first few pieces of the puzzle).

Jun 28 '14 22:06 SheetJSDev

Thanks, good work. I hope so we will have efficient solution soon.

Jul 01 '14 02:07 irfanyounas

@irfanyounas just so we are on the same page, can you share the script you used?

Jul 01 '14 03:07 SheetJSDev

I am using the script give below: src="//cdnjs.cloudflare.com/ajax/libs/xlsx/0.7.6-g/xlsx.core.min.js"

For creating xlsx file, I have followed the following example: http://sheetjs.com/demos/writexlsx.html

Jul 01 '14 16:07 irfanyounas

@irfanyounas how did you create a JS object with 65000 rows? Do you have numbers / text / dates? Are there missing cells (in the example, C2 is missing) or did you fill every cell?

Jul 01 '14 16:07 SheetJSDev

@irfanyounas Here's an example writing numbers in a 3x100001 file: http://sheetjs.com/demos/100001.html

Dates are printed at 3 times:

just before starting to build the workbook
after the write
after the saveas call

On Chrome 35 it takes 4 seconds to process a much larger test case than what you showed.

Jul 01 '14 16:07 SheetJSDev

Thanks, I have 3 'text' (string) columns. The second column is empty sometimes. I have used the following script:

function writeToXLSX(data) {
    var ws_name = "AllMembers",
          wb,
          wbout;
    wb = new Workbook(), ws = sheet_from_array_of_json(data, ["address", "name", "status"]);
    wb.SheetNames.push(ws_name);

    wb.Sheets[ws_name] = ws;
    wbout = XLSX.write(wb, {bookType:'xlsx', bookSST:true, type: 'binary'});
    postMessage({t:"data", d:wbout});
}

function Workbook() {
    if(!(this instanceof Workbook)) return new Workbook();
    this.SheetNames = [];
    this.Sheets = {};
}

function stringToArrayBuffer(s) {
    var buf = new ArrayBuffer(s.length);
    var view = new Uint8Array(buf);
    for (var i=0; i!=s.length; ++i) view[i] = s.charCodeAt(i) & 0xFF;
    return buf;
}


function sheet_from_array_of_json(data, columnNames) {
    var ws = {}, jsonObj, cell, cell_ref;
    var range = {s: {c:0, r:0}, e: {c:0, r:0 }};
    for(var R = 0; R != data.length; ++R) {
                jsonObj = data[R];

                if(range.e.r < R) range.e.r = R;
                for(var C = 0; C != columnNames.length; ++C) {

                        if(range.e.c < C) range.e.c = C;

                        if (jsonObj.hasOwnProperty(columnNames[C])) {
                cell = {v: jsonObj[columnNames[C]]};
                        } else {
                continue;
                        }
            if(cell.v == null)  {
                            continue; 
                        }

            cell_ref = XLSX.utils.encode_cell({c:C,r:R});

            if(typeof cell.v === 'number') cell.t = 'n';
            else if(typeof cell.v === 'boolean') cell.t = 'b';
            else if(cell.v instanceof Date) {
                cell.t = 'n'; cell.z = XLSX.SSF._table[14];
                cell.v = datenum(cell.v);
            }
                    else cell.t = 's';

            ws[cell_ref] = cell;
        }
    }
    ws['!ref'] = XLSX.utils.encode_range(range);
    return ws;
}

Jul 01 '14 16:07 irfanyounas

@irfanyounas http://sheetjs.com/demos/write_num.html?n=1000000 is a more general demo (you can specify the number of rows via the n parameter). The runtime appears to be dominated by generating the source array-of-arrays (not from the process that converts it to a workbook, and not from the process that generates the XLSX file).

Here are the times I see using the aforementioned page:

Number of Rows	Time
100K	4 sec
200K	8 sec
300K	15 sec
400K	18 sec
500K	Chrome crashes ...

Jul 01 '14 16:07 SheetJSDev

@irfanyounas just saw your last message -- where are you measuring the times? Your code sample doesn't show it. Also, can you put up the main page (it looks like you shared only the worker logic)

Jul 01 '14 16:07 SheetJSDev

Thanks, I am looking into it, may the problem is on my side. Thanks for your help

Jul 01 '14 17:07 irfanyounas

If you are using web workers, I recommend also timing the process in the main thread (avoid we workers altogether). FWIW I found in some cases that transferring data to and from the worker is significantly slower than just doing the work in the main thread.

Jul 01 '14 18:07 SheetJSDev

I have tested it using different kinds of data:

Scenario 1: I create a list of 20,000 records, where each records is a dict as given: {"status": "active", "name": "", "address": "sms:+45705846525"}. In this scenario all the records are same (copy of the dict shown above).

Scenario 2: The list of 20,000 records, where each recordsis a dict but each record is different.

Now runtime for Senario 1 is 4 sec, while for Scenario 2, it is around 14 sec.

As per my understanding, May be zip module is taking long due to different records (not same text) in Scenario 2. Can you please try this scenario?

On Tue, Jul 1, 2014 at 2:11 PM, SheetJSDev [email protected] wrote:

If you are using web workers, I recommend also timing the process in the main thread (avoid we workers altogether). FWIW I found in some cases that transferring data to and from the worker is significantly slower than just doing the work in the main thread.

— Reply to this email directly or view it on GitHub https://github.com/SheetJS/js-xlsx/issues/77#issuecomment-47690723.

Jul 01 '14 20:07 irfanyounas

Here's a good example: http://sheetjs.com/demos/write_text.html?n=8000. The length of the data in the B column increases as you go down. I do see a massive performance issue

Jul 01 '14 21:07 SheetJSDev

Yes, though we can keep the length of the data in B column fixed, the performance issue will also be there if the data is different, e.g, in my case the data looks like 'sms:+45704055659', 'sms:+457683738392', 'sms:+457049821521' and so on.....

Jul 01 '14 21:07 irfanyounas

I did some more testing by adding some timing statements (simple console.log(new Date().getTime()) all over the place) and about 90% of the runtime is in the zip.generate operation, which suggests that it should be replaced.

Jul 01 '14 22:07 SheetJSDev

Yes that's right, may be, we can try different zip modules and see the performance. Thanks

Jul 01 '14 22:07 irfanyounas

@SheetJSDev Not to divert you from the main topic, but still: it's easier to type console.time and console.timeEnd instead of explicit console.log(new Date().getTime()).

Jul 02 '14 03:07 Mithgol

@Mithgol this is really neat! I think the MDN article needs to be updated: https://developer.mozilla.org/en-US/docs/Web/API/console.time claims that IE11 supports it, but https://developer.mozilla.org/en-US/docs/Web/API/console.timeEnd has a ? for IE

Jul 02 '14 03:07 SheetJSDev

MDN is a wiki. If you think that it needs to be updated, update it.

(I wound have updated it myself, but for some reason I'm on Windows XP currently and IE11 is not installable on such an old OS.)

Jul 02 '14 05:07 Mithgol

@Mithgol updated :) IE11 actually shows the elapsed time down to the 100-nsec level (milliseconds + 4 decimal places)

Jul 02 '14 05:07 SheetJSDev

When I export a big amount of rows in Firefox I get a dialog saying something like "A script is broken or doesn't respond. [Debug Script] [Stop Script] [Continue Script]"

No matter how fast the export will be, by exporting even more data it will always be possible to get this message.

If Firefox gets some time slices for processing by adding window.setTimeout(function(){...},milliseconds); it should solve the isse. I know that it adds time to the export. But as a user I understand that it is a long running task to export this big amount of data. If I provide the user with a message saying that an export is in progress he knows why the UI is blocked. I would be even better if it would be possible to cancel a running export.

Jul 02 '14 18:07 sandroboehme

@sandroboehme you can perform the write process in a WebWorker (which wouldn't lock up the UI). For example, http://oss.sheetjs.com/js-xlsx/ (it's hosted from the gh-pages branch ) uses a web worker to read data.

The main reader demo (http://oss.sheetjs.com/, source https://github.com/SheetJS/SheetJS.github.io) shows a spinner when the read process may take a while)

Jul 02 '14 18:07 SheetJSDev

Regarding XLSX write: I have debugged the code and found out that inside function write_ws_xml_cell(), the following line is creating the performance issue:

v = writetag('v', ''+get_sst_id(opts.Strings, cell.v));

Specifically the following function : get_sst_id(opts.Strings, cell.v)

Jul 02 '14 21:07 irfanyounas

@irfanyounas That function performs a scan in the shared string table. That behavior can be disabled by setting the option bookSST to false.

I agree that a linear scan is horrible here, but from my testing, it contributes less than 5% to the runtime at 20K rows. It does, however, explain the superlinear runtime

Jul 02 '14 21:07 SheetJSDev

If the text data is same in all rows then its true that it will be fast, but if you have unique data in each row then the performance will be poor. For example you can test it by creating data using the following code: var data1 = [];

while(data1.length < 20000) { var randomnumber = Math.floor(Math.random() * (799999999 - 700000000 + 1)) + 700000000; var sms = "sms:+45"+randomnumber; data1[data1.length] = ["active", "abc", sms]; }

Jul 02 '14 21:07 irfanyounas

Thanks. I was wondering what if we don't perform a scan in the shared string table? Does it effect the compressed file size?

Jul 02 '14 22:07 irfanyounas

@irfanyounas The original problem is that some readers (notably iOS Numbers) had issues with inline strings. The SST was the only way for strings to be displayed. Testing it now, it appears that the newest version of Numbers does support inline strings, so it might make sense to recommend inline strings.

Adding the shared string table is larger for files that don't have repeated elements (for example, with tables that have text labels and numeric bodies) but smaller for files with lots of repeated strings (for example, a table where one column describes a category).

The mystery here (which explains the difference in our understanding) is that as the size of the data increases, the zip time (as a percentage of the total write process) gets progressively worse. Basically, there are two effects: the effect of the linear scan and the effect of having to write a new file in the zip. To see this, add a console.time("zip") just before the switch statement in write_zip and a `console.timeEnd("zip") in your script. I generate data this way:

var w = 2048; // <-- width of each string
var MAXLEN = 10000; // <-- number of rows

var o = new Array(MAXLEN + w);
for(var i = 0; i != MAXLEN + w; ++i) o[i] = String.fromCharCode(i); // safe for values before 0xd000 = 53248
var data = new Array(MAXLEN);
for(var dd = 0; dd != MAXLEN; ++dd) data[dd] = ["foo", o.slice(dd, dd+w).join(""), dd+2];

This ensures v8 doesn't perform some magic and ensures we have distinct strings.

For small w, the zip operations take less than half the total runtime, but as w is increased the zip operations almost completely dominate the runtime (in both cases). According to the v8 profile (you can get this in node by running with the --prof flag and then using the tick processor), the function that takes the most time is the CRC-32 function.

Jul 02 '14 23:07 SheetJSDev

Thanks, that's right. You are doing really good work.

Jul 03 '14 16:07 irfanyounas

I'm having this issue too. Is there any form of the xls spec that doesn't require zipping? That might be a workaround till the zip issue is taken care of.

Jul 12 '15 19:07 turbobuilt

The jszip api offers the ability to not compress the zip file. Would that help?

https://stuk.github.io/jszip/documentation/api_jszip/generate.html

Jan 05 '16 15:01 turbobuilt

31 Columns and 300K rows from API , this json_to_sheet/ writeFile breaks. please see if this can addressed.

Feb 09 '22 23:02 kumarr10

@kumarr10 live demo https://jsfiddle.net/xgr4sbk1/

var NR = 300000, NC = 31;
console.time("prep");
var ws = XLSX.utils.aoa_to_sheet(Array.from({length:NR}, (_,i) => Array.from({length: NC}, (_,j) => j == 0 ? `Row${i+1}` : i+j)), {dense: true});
var wb = XLSX.utils.book_new(); XLSX.utils.book_append_sheet(wb, ws, "Sheet1");
console.timeEnd("prep");
console.time("write");
XLSX.writeFile(wb, "out.xlsx");
console.timeEnd("write");

Using chrome 0.18.3 on a 2018 macbook pro it takes ~13 seconds to generate 300K x 31 fully populated worksheet. You can play with the NR / NC variables. The new limit is the string length cap when generating the worksheet xml.

Mar 03 '22 10:03 SheetJSDev

Hi All,

I have struggle to export large rows of data for xlsx export. The data is 100 columns X 100K rows for JSON data. Browser either goes into freeze mode or you will get array length error hitting the string limitation for V8 in chrome.

Following is how I resolve this, so hopefully it will shed some light to some who are having similar issues.

First of all update the version used 0.18.5 includes some performance string conversion.
I used json_to_sheet initially to convert to a worksheet. This always fail for large data most likely hitting a limitation in chrome or v8. Unable to solve this. No errors was raised in chrome and browser tab goes into infinite loop. This was resolved by changing the JSON object mapping to direct array mapping. 3 Example. JSON data = { {a:111 , b: 333, c: 444} {a:111 , b: 333, c: 444} } Convert to data = [ [111,333,444] , [111,333,444] ]
Hence converted from json to arrray of arrays. Using aoa_to_sheet worked great. Use the option dense : true and raw : true. 100K X 100 columns array changed to worksheet in like 10s.
The next problem was writeFile, which raises range error in chrome. Googling this tells me the we hit the string limit in chrome. In writeFile, the whole workbook is converted to string before downloading using a string join.
I manage to successfully download by enabling the following options. type : binary bookSST: true compression: true
The most important thing here is the bookSST. Basically, this is like indexing of repeating data in the cells. One of the reason we are hitting the string error is the xml set blank cells to XML preserve = space string in each of the cell. Hence, empty cells now has an xml string of like 20 to 30 characters associated to it. By using bookSST, this will be index and just a reference index key will be populated. This will reduce the total string length.
compression will reduce the file size.
Finally manage to download the file.

These are all based on my understanding of the code and various help from the developers when I raised issues to them. They been a great help. Hopefully this will help others encountering the same problem.

Thanks. Anand Muthu

Apr 16 '22 13:04 anandmuthu1978

"bookSST": true, "compression": true

Thanks for your explanations @anandmuthu1978!

Apr 20 '22 11:04 dtslvr

Thanks for sharing @anandmuthu1978 ! bookSST: true is helpful for files containing a large number of repeated strings (text labels for each row, for example) and compression: true obviously reduces the file size, but the tradeoff is increased runtime. Using type: "array" generates a Uint8Array rather than a string (which should be more efficient)

Apr 20 '22 16:04 SheetJSDev

@anandmuthu1978 hello

Could you give me an example how you managed to solve to write a large number of lines?

With my code, I can save up to 60k lines. With 70k I couldn't understand why it saves only 15 lines, and it doesn't show any error.

try {

  const ws = xlsx.utils.json_to_sheet(data)
  const wb = xlsx.utils.book_new()

  xlsx.utils.book_append_sheet(wb, ws)

  await xlsx.writeFileSync(wb, fileName)

} catch (e) {

  console.log(e)
}

thks

May 10 '22 14:05 goproclube

@goproclube hi. In my case i converted the data from json to array and used aoa_to_sheet(data). Most likely in your case 70k is already hitting the memory limit. I noticed json_to_sheet takes a lot of memory so I converted it to array format. You will lose the data mapping association but as long you want all it doesnt really matter. Example. JSON data = { {a:111 , b: 333, c: 444} {a:111 , b: 333, c: 444} } Convert to data = [ [111,333,444] , [111,333,444] ]

May 15 '22 11:05 anandmuthu1978

@SheetJSDev is it possible, when writing big files to .xlsx, somehow append worksheets together? I mean, imagine I have 700k rows, my CHUNK_SIZE = 350k, I create worksheet, append it to workbook and pass it to XLSX.write and get typed array back. On my next step I repeat the same steps, but now with another 350k piece of data. At the end I get 2 typed Arrays. My question is it possible to merge worksheets together in order to get them in one file. If so, we will be able to pass chunked data to xlsx and zipping manipulations will not take so long time.

Jun 17 '22 10:06 farideliyev

sheetjs sheetjs copied to clipboard

XLSX.write() taking too long

sheetjs
sheetjs copied to clipboard