fast_excel
fast_excel copied to clipboard
is fast_excel thread safe?
Hey!
We have some data, and want to query it from the DB, and export it to excel.
workbook = FastExcel.open(tmp_file_path, constant_memory: true)
sheet = workbook.add_worksheet('name)
sheet.write_row(0, column_names, workbook.bold_cell_format)
So the problem is, I want to write it parallelly.
limit = 50_000
count = 523_523 # whatever
0.step(count, limit).each do |offset|
fork do
data = db_query(offset, limit)
data.each.with_index do |row, idx|
sheet.write_row(
offset + idx + 1,
row
)
end
end
end
Process.waitall
workbook.close
But the result is weird. Sometimes I have the half of the data, and sometimes a bit more. (but without empty rows!! which is super weird)
So the question is, fast_excel
is thread unsafe, or I did miss something?
I believe fast_excel
is thread safe for atomic writes (write one cell) but write_row
and append_row
is not because it iterates over array and update state.
fork
will create new process and file descriptors can not be shared between processes, I would generally avoid forking...
According to my benchmarks fast_excel
is same fast as standart csv
library, probably you facing issue with slow activerecord allocations, also that may consume lots of memory and process may start using swap. I have other project light_record
that optimize AR allocations and support streaming for mysql (streaming will allocate AR objects on demand), with that I can generate about 4000 rows per second with ~20 columns (text, dates, numbers)
How many rows are you trying to write?
This light_record
would be great, but I am trying to query from a relatively heavy db view.
And there is no model. (yet)
This view is about ~750k rows, but I have bigger data too to export. Time is not as important, as the memory consumption.
Querying with offset/limit may be slow, depends on query inside view, usually every next request will take longer time. Splitting query with primary key range will be faster (like id > 1000 && id < 2000
)
I would suggest to use database client directly (also can get it from ActiveRecord::Base.connection_pool.checkout
), run one query to get all rows and use streaming, then do type casting when necessary and write to excel.
Then it will fetch records and write to excel one by one, should be very memory efficient
Mysql: https://github.com/brianmario/mysql2#streaming Postgres: https://www.rubydoc.info/github/ged/ruby-pg/PG%2FConnection:set_single_row_mode
It's a bit better now, thanks for the idea.