xlsxtream Document usage with Rails streaming API

Document usage with Rails streaming API

Open alexanderadam opened this issue 6 years ago • 17 comments

Hi @felixbuenemann,

first of all: xlsxtream is awesome! Thank you for that gem.

I just found out, that you made it possible since version 2 to stream it with the Rails streaming API and thought it might be good if it would be documented as well.

Thank you and have a wonderful day!

Mar 08 '18 15:03 alexanderadam

Are you suggesting to add an example on how to use this gem with ActionController::Live streaming to the README?

Mar 08 '18 21:03 felixbuenemann

Oh I think that it should be documented but I'm very undecided about the what and how. Some gems document that into the README, some have a dedicated doc/ directory with more documentation and some have an examples/ directory.

I'm not sure whether it should be directly in the README as rails is just one of the useful integrations. So I guess one of the other two variants might be more suitable.

But in general I think it would be useful to advertise the possibilities as this is probably the only Ruby xlsx library that provides streaming functionality.

Mar 09 '18 09:03 alexanderadam

@alexanderadam How about an examples/ directory plus a link from the bottom of the README.md?

Mar 12 '18 09:03 sandstrom

@sandstrom sounds perfect :+1:

Mar 12 '18 17:03 alexanderadam

I'm very busy right now, but I'm happy to accept a PR with an example.

Both a file in the repo or a link to a gist would work for me, but including it in the repo probably makes it easier to keep up to date in the future.

Mar 12 '18 22:03 felixbuenemann

Oh just though about it: An example could also live in the Wiki, which is even easier to update than an example in the repo.

Mar 14 '18 19:03 felixbuenemann

I would love to help. Already pull new request for README.md file.

Sep 04 '18 07:09 zmd94

@zmd94 Note that the example code I gave you does not use the rails streaming api, which is discussed here. The streaming API allows to stream a large XLSX file without ever writing to disk.

Sep 04 '18 13:09 felixbuenemann

I was able to get streaming working using the following (in a controller method):

    def export
      sql = MyModel.where(something: 'test').where.not(something: 'else').to_sql
      
      headers['Content-Type'] = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
      headers['Content-disposition'] = "attachment; filename=\"#{filename}.xlsx\""
      headers['X-Accel-Buffering'] = 'no'
      headers['Cache-Control'] = 'no-cache'
      headers['Last-Modified'] = Time.zone.now.ctime.to_s
      headers.delete('Content-Length')

      # Return enumerator to stream response
      self.response_body = Enumerator.new do |enumerator|

        # Create the workbook
        xlsx = Xlsxtream::Workbook.new(enumerator, font: {
          # name: 'Times New Roman',
          # size: 10, # size in pt
          # family: 'Roman' # Swiss, Modern, Script, Decorative
        })

        # Write the worksheet
        header_written = false
        xlsx.write_worksheet(name: 'My Excel Sheet', auto_format: true) do |sheet|

          raw_connection = ActiveRecord::Base.connection.raw_connection
          raw_connection.send_query(sql)
          raw_connection.set_single_row_mode
          raw_connection.get_result.stream_each do |record|
            unless header_written
              sheet.add_row record.keys
              header_written = true
            end

            sheet.add_row record.values
          end
          raw_connection.get_result
        end
        xlsx.close
      end
    end

Disclaimer: This is using a Postgres / pg streaming syntax, so this will probably differ if you're using MySQL or another database type.

Aug 11 '20 00:08 arcreative

@felixbuenemann Does the above example @arcreative provided works or would you prefer any other approach ?

Oct 09 '20 15:10 mohdasim8018

@felixbuenemann Does the above example @arcreative provided works or would you prefer any other approach ?

Depending on how performance critical your code is, you could also just use find_each or pluck_each (gem) to stream the rows.

Oct 11 '20 23:10 felixbuenemann

@felixbuenemann Below is the sample code that I am planning to use for 300k-400k records.

 def get_users(org)
    headers["Content-Type"] = "application/vnd.openxmlformates-officedocument.spreadsheetml.sheet"
    headers['X-Accel-Buffering'] = 'no'
    headers["Cache-Control"] ||= "no-cache"
    headers.delete("Content-Length")

    self.response_body = Enumerator.new do |enumerator|
      # Create the workbook
      xlsx = Xlsxtream::Workbook.new(enumerator)

      xlsx.write_worksheet(name: 'Journey', auto_format: true) do |sheet|
        # Generate header colums
        sheet.add_row %w(firstname lastname username group_name journey_assignment_date journey_status journey_name)
        # Generate rows
        User.where("org=?", org).select('first_name, last_name, username').find_each do |user|
          sheet.add_row [ user.firstname, user.lastname, user.username]
        end
      end

      xlsx.close
    end
  end

Please let me know if you have additional suggestions.

Oct 11 '20 23:10 mohdasim8018

At 300-400K rows I'd suggest skipping model overhead and go to straight arrays, using something like arcreative proposed.

You could also use the AR in_batches method combined with pluck to load in batches of 1000 rows straight to arrays:

User.where(org: org).in_batches do |batch|
  batch.pluck(:first_name, :last_name, :username).each do |row|
    sheet.add_row row
  end
end

Btw. your example code has a typo in the mime type header (formates).

Oct 12 '20 18:10 felixbuenemann

I've done some benchmarking on a 440K row table and the solution from @arcreative is a lot faster than in_batches/pluck with about 700ms vs 5.6s. The naive find_each took 6.8s, I would've expected pluck to be much faster then going through the model.

The performance of in_batches isn't so great since it first fetches all ids for the batch in onre query and then feeds them via an IN query to anything called on the relation. This probably makes it easier to implement, but sucks for performance. At least it uses keyset pagination using the PK instead of the infamous OFFSET, which gets slower the further you progress into the result, since the DB needs to fetch all rows up to the offset.

Oct 12 '20 19:10 felixbuenemann

My streaming example should be about as fast as it can possibly get since it's a single query without model or "array of records" involvement. It also has the advantage of not buffering any more than a single line to memory at a given time. Not sure what streaming API looks like for the mysql2 gem, but if you're using Postgres I would definitely go with my snippet.

Historically speaking, I got tired of iteratively optimizing our data export solution and just straight to ideal state. Only faster way to "get data" is to actually have Postgres export gzipped CSV and stream that right to the client, which obviously doesn't work if you need an XLSX file. It might be faster to dump to a tempfile and use some other tool to convert to XLSX, but that would obviously be circumventing this gem entirely.

Oct 12 '20 19:10 arcreative

@arcreative Your code is totally fine, I would probably do the same.

The mysql2 gem also allow streaming a query: https://github.com/brianmario/mysql2#streaming

I haven't tested it, since I try to avoid MySQL if I have the choice ;-)

I only mentioned alternatives for when the query result is small, so it doesn't really matter.

One thing your example seems to be lacking is error handling, since is doesn't ensure the query result is consumed using an ensure block or something like that and the next query would fail (you need to call the last get_result before running another send_query on the same connection).

For CSV import/export PostreSQL COPY is awesome. I've used it to import gigabytes of CSV data in ruby at max speed.

For the record here's a database agnostic non-streaming example using keyset pagination which is comparably fast to the streaming version, but uglier and it assumes username is a unique column, since the result excludes the primary key:

users = User.where(org: org).reorder(:username).select(:first_name, :last_name, :username).limit(10_000)
page = User.arel_table[:username]
conn = User.connection
username = ''
loop do
  rows = conn.select_rows(users.where(page.gt(username)).to_sql)
  break if rows.empty?
  username = rows.last.last
  rows.each { |row| sheet << row }
end

So if you can use streaming for query results then use it by all means, but if not the above example might help with writing a fast query. If you don't know what keyset pagination is, I suggest to read https://use-the-index-luke.com/no-offset.

Btw. the Rails implementation for in_batches is so inefficient, because it needs to be generic and caters for the case where a query on the batch relation excludes the primary key column, which it uses for keyset pagination.

Oct 12 '20 20:10 felixbuenemann

Oh you guys rock! This just saved us a TON of time! @arcreative that snippet was gold!

Please let me know if there is any way we can contribute! As far as I'm concerned we owe you countless hours you saved 😂

Nov 10 '21 03:11 chaffeqa

xlsxtream xlsxtream copied to clipboard

Document usage with Rails streaming API

xlsxtream
xlsxtream copied to clipboard