fast_excel icon indicating copy to clipboard operation
fast_excel copied to clipboard

is fast_excel thread safe?

Open Mikopet opened this issue 6 years ago • 4 comments

Hey!

We have some data, and want to query it from the DB, and export it to excel.

workbook = FastExcel.open(tmp_file_path, constant_memory: true)
sheet = workbook.add_worksheet('name)
sheet.write_row(0, column_names, workbook.bold_cell_format)

So the problem is, I want to write it parallelly.

limit = 50_000
count = 523_523 # whatever

0.step(count, limit).each do |offset|
  fork do
    data = db_query(offset, limit)

    data.each.with_index do |row, idx|
      sheet.write_row(
        offset + idx + 1,
        row
      )
    end
  end
end

Process.waitall
workbook.close

But the result is weird. Sometimes I have the half of the data, and sometimes a bit more. (but without empty rows!! which is super weird) So the question is, fast_excel is thread unsafe, or I did miss something?

Mikopet avatar Nov 27 '18 13:11 Mikopet

I believe fast_excel is thread safe for atomic writes (write one cell) but write_row and append_row is not because it iterates over array and update state.

fork will create new process and file descriptors can not be shared between processes, I would generally avoid forking...

According to my benchmarks fast_excel is same fast as standart csv library, probably you facing issue with slow activerecord allocations, also that may consume lots of memory and process may start using swap. I have other project light_record that optimize AR allocations and support streaming for mysql (streaming will allocate AR objects on demand), with that I can generate about 4000 rows per second with ~20 columns (text, dates, numbers)

How many rows are you trying to write?

Paxa avatar Nov 28 '18 06:11 Paxa

This light_record would be great, but I am trying to query from a relatively heavy db view. And there is no model. (yet)

This view is about ~750k rows, but I have bigger data too to export. Time is not as important, as the memory consumption.

Mikopet avatar Nov 28 '18 16:11 Mikopet

Querying with offset/limit may be slow, depends on query inside view, usually every next request will take longer time. Splitting query with primary key range will be faster (like id > 1000 && id < 2000)

I would suggest to use database client directly (also can get it from ActiveRecord::Base.connection_pool.checkout), run one query to get all rows and use streaming, then do type casting when necessary and write to excel.

Then it will fetch records and write to excel one by one, should be very memory efficient

Mysql: https://github.com/brianmario/mysql2#streaming Postgres: https://www.rubydoc.info/github/ged/ruby-pg/PG%2FConnection:set_single_row_mode

Paxa avatar Nov 28 '18 17:11 Paxa

It's a bit better now, thanks for the idea.

Mikopet avatar Nov 30 '18 11:11 Mikopet