axlsx icon indicating copy to clipboard operation
axlsx copied to clipboard

Memory usage for large datasets

Open guilleva opened this issue 11 years ago • 42 comments
trafficstars

Hi, I'm facing a memory issue when trying to generate a xlsx file with >600 columns and ~10k rows. The memory usage grows up to 3GB aprox and it's making the dyno be restarted by heroku.

I'm wondering if there is an known way to generate it without making a heavy use of memory.

The way I'm doing it can be represented by this script:

Axlsx::Package.new do |p|
  p.workbook.add_worksheet(:name => "Test") do |sheet|
    11_000.times do
        sheet.add_row ["test data"] * 600
    end
  end
  p.serialize('tmp/test.xlsx')
end

guilleva avatar Jan 10 '14 16:01 guilleva

We're experiencing the same issue, perhaps RubyZip could be replaced with https://bitbucket.org/winebarrel/zip-ruby/wiki/Home ?

michaeldauria avatar Jan 15 '14 19:01 michaeldauria

RubyZip isn't the issue I think, axlsx is. One of the problems is that axlsx creates an enormous string which is then passes to RubyZip. That's very inefficient..

jurriaan avatar Jan 15 '14 20:01 jurriaan

Tried to fix it @ https://github.com/jurriaan/axlsx/tree/fix_enormous_string Please report if this helps (it should reduce usage to 1.5 GB or so)

jurriaan avatar Jan 15 '14 23:01 jurriaan

@guilleva @michaeldauria Did this change work for you? It should be possible to reduce the memory usage even further, but it's a beginning :)

jurriaan avatar Jan 17 '14 12:01 jurriaan

Hi jurriaan, thanks for looking into this, I haven't been able to test it, I really sorry, I will do my best to do it today and I will let you know.

Thanks again

guilleva avatar Jan 17 '14 17:01 guilleva

Just pushed some new commits, the example now uses ~700 MB on my system..

jurriaan avatar Jan 17 '14 21:01 jurriaan

I will try this out today and report back

michaeldauria avatar Jan 22 '14 17:01 michaeldauria

It is possible it won't help in your situation though.. the fact is that atm axlsx allocates way too much objects and causes the heap to grow very big.. I'm trying to reduce the number of allocated objects..

jurriaan avatar Jan 22 '14 17:01 jurriaan

This did not end up helping too much, memory seemed more stable, but we still ran out anyway.

michaeldauria avatar Jan 28 '14 14:01 michaeldauria

We have a similar issue. We're building sheets with hundreds of thousands of rows and as many as 100 tabs. axlsx uses 5.3GB to build a 111MB xlsx file.

I would be happy to test some changes on this, but your branch is no longer available. Any further updates on this?

jeremywadsack avatar May 29 '14 17:05 jeremywadsack

@jeremywadsack I've merged most of the changes into the master branch, but it's still using a lot of memory. I hope to have some time to look at this issue again in the coming weeks

jurriaan avatar Jun 19 '14 08:06 jurriaan

Hi. Any updates on this?

maxle avatar Sep 18 '14 17:09 maxle

Our Delayed Job workers aren't processing jobs (out of memory), probably because of this issue. Any updates on the status? Cheers!

srpouyet avatar Feb 03 '15 08:02 srpouyet

+1 on this. We are having the same issue with huge datasets :/

pallymore avatar Apr 27 '15 19:04 pallymore

We just ended up spinning up more hardware to deal with this for the time being. :/

jeremywadsack avatar Apr 27 '15 20:04 jeremywadsack

Old thread. We're also having problems. 2.1 MB file needs 650MB of RAM.

Still waiting for a solution

hassanrehman avatar Sep 29 '15 11:09 hassanrehman

Yeah. We would still love a solution. We run a 24gb server instance so we can build 15mb files.

Old thread. We're also having problems. 2.1 MB file needs 650MB of RAM.

Still waiting for a solution

— Reply to this email directly or view it on GitHub https://github.com/randym/axlsx/issues/276#issuecomment-144029547.

jeremywadsack avatar Sep 29 '15 15:09 jeremywadsack

Just had another build crash because a 400MB Excel file took 22GB of memory to build. (Incidentally, Excel used 1.5GB to open this file.)

In the python world we use xlsxwriter to generate Excel files which has a few options to reduce memory including flushing rows to disk as they are written. There are trade-offs there but it's possible something like that could be done with axlsx as well?

jeremywadsack avatar Oct 09 '15 23:10 jeremywadsack

+1 for a solution/update for this.

AaronMegaphone avatar Nov 19 '15 16:11 AaronMegaphone

+1 for a solution

sudoremo avatar Nov 26 '15 09:11 sudoremo

First of all, thank you for this wonderful gem. I'm also having this problem when trying to generate 30k rows of text. As an alternative (and as opposed to increasing memory), does Axlsx have the capability to append data to an existing spreadsheet or join two of them without exhausting memory?

I'd at least be able to write multiple files and join them together at the end.

altjx avatar Jan 19 '16 20:01 altjx

Maybe there could be an option of using a temp file instead of filling a StringIO object, once it is finished it can just place everything in the zip file. Thoughts?

aalvarado avatar Jan 19 '16 21:01 aalvarado

The XLSX file isn't a single file. It's a collection of files that are zipped together. Which I think makes this more complicated.

The python library xlsxwriter has a feature like this:

The optimization works by flushing each row after a subsequent row is written. In this way the largest amount of data held in memory for a worksheet is the amount of data required to hold a single row of data.

This does introduce limits in how you can use it:

Since each new row flushes the previous row, data must be written in sequential row order when 'constant_memory' mode is on:

Perhaps something like this could be implemented in axlsx, though?

jeremywadsack avatar Jan 19 '16 21:01 jeremywadsack

I haven't taken a look at it just yet, but I'm assuming it's just like a word doc... a bunch of XML files zipped together. However, with word docs you can indeed extract certain contents from the word/document.xml file, and I've done this to pull templates and insert data into other word docs (same process).

During the mean time, I'm going to try working on my own solution. Haven't seen much feedback from the guilleva here so not really expecting a fix anytime soon.

altjx avatar Jan 19 '16 22:01 altjx

The XLSX file isn't a single file. It's a collection of files that are zipped together. Which I think makes this more complicated.

But the file that has the actual rows can it be still be written sequentially no?

looking inside an xlsx file I can see that the xl/_rel/sharedStrings.xml has the sheet data in it.

aalvarado avatar Jan 20 '16 14:01 aalvarado

+1, this is a huge issue for me. I'd be happy to hear about possible workarounds.

pol0nium avatar Jul 11 '16 12:07 pol0nium

+1, big issue for me too.

Wenzel avatar Jul 11 '16 14:07 Wenzel

Would it help if I'd add a bounty via bountysource? @randym what do you think?

pol0nium avatar Jul 18 '16 16:07 pol0nium

#352 is a similar request to append rows to an existing file.

jeremywadsack avatar Oct 27 '16 17:10 jeremywadsack

I've been struggling with same issues, when generating > 100k rows our servers started to use swap and become very slow.

Solved by making own excel library https://github.com/paxa/fast_excel, it works 3-4 times faster and consume 5-20 MB ram regardless of data size

Paxa avatar Apr 23 '17 17:04 Paxa