Stream CSV file downloads
If the gem is processing a CSV task, it currently downloads the entire file into memory in the job: https://github.com/Shopify/maintenance_tasks/blob/e62fa637c8912fa8f962aecc691aeb3d82d4ac82/app/jobs/concerns/maintenance_tasks/task_job_concern.rb#L105-L110
This uses a ton of memory for large files and doesn't scale well. We should investigate how we could download the file in chunks, maybe with #download_chunk.
I wanted to consolidate different discussions here. If I understood correctly, using #download_chunk or even #download with a block would probably work for not loading the file at once but we seem to lack a proper streaming API for remote files which would work like StringIO does for locally stored files.
@etiennebarrie please correct me if I'm wrong and/or missed something from what we discussed.
Yes that's it, CSV takes an IO to be able to stream, but download_chunk just returns a chunk as a String, we need an object that plays the role of an IO but which can call back to download_chunk to get the next one.
Any concerns with downloading to tempfile? Then the CSV gem can stream it without an issue. The ergonomics in Rails could be improved (https://github.com/rails/rails/pull/49990) but until that is merged the code is trivial enough to duplicate here.
Hey @bdewater! Downloading to a tempfile that we can enumerate seems viable to me! (Tempfiles have their quirks, but can't think of anything immediately that would prevent us from doing that 🤔 )
Were you envisioning some like this, where we'd download to a tempfile in #before_perform?
def before_perform
...
if @task.has_csv_content?
Tempfile.open do |tempfile|
tempfile.binmode
@run.csv_file.download do |chunk|
tempfile.write(chunk)
end
tempfile.rewind
# Maybe we change the name of this attr; I'm also assuming that referencing the tempfile this way will
# prevent the tempfile from being GCed for the duration of the job..?
@task.csv_content = tempfile
end
...
end
And then we'd define CsvCollectionBuilder#collection to open it as a CSV?
class CsvCollectionBuilder
def collection(task)
CSV.open(csv_content, headers: true )
end
...
end
Hey @adrianna-chang-shopify 👋 yes, something along those lines could work. Re. the GC comment, CSV internally builds a CSV::Parser object that holds a reference to the given input so I believe that shouldn't be a problem.
I have also been toying with a streaming CSV enumerator for the job-iteration gem and while it looks to work I haven't had the time to write tests and submit a PR. The tempfile approach solved the immediate problem of mem usage and was good enough.
Amazing, yeah downloading to a tempfile hadn't even crossed my mind, but that certainly solves the issue of downloading the whole file up front. Do you have the bandwidth to work on this? If not, I can try to find time to explore in the next couple weeks 😄
This issue has been marked as stale because it has not been commented on in two months. Please reply in order to keep the issue open. Otherwise, it will close in 14 days. Thank you for contributing!