daru icon indicating copy to clipboard operation
daru copied to clipboard

Faster loading from CSV files.

Open gnilrets opened this issue 10 years ago • 9 comments

I'm really loving the idea of this project. My only concern is performance. Reading from a 4,000 line CSV file is taking 7s (WAY too long if I'm going to try to scale to even small data sizes on the order of 100k rows). I was going to try using NMatrix, but don't see how I could try that when reading from a CSV. For example, how could I convert something like this to use NMatrix?

df = Daru::DataFrame.from_csv 'myfile.txt', { headers: true, col_sep: "\t", encoding: "ISO-8859-1:UTF-8" }

Any other ideas on how to improve performance?

gnilrets avatar Nov 12 '15 04:11 gnilrets

Daru is currently using the default ruby CSV library (written in pure ruby) for reading CSV files so thats a botteneck we cant avoid.

But there a bunch of options that you can specify to daru for speed, mainly things which avoiding cloning data or populating missing values.

For example, set the variable lazy_update to true. So Daru.lazy_update = true will delay updating the dataframe's missing values tracking mechanism until you call #update). See this notebook.

Passing the clone: false option will avoid cloning of the columns that have been read by the CSV file. It is true by default so you might want to change that.

Here is an example of daru being used for larger data.

v0dro avatar Nov 12 '15 08:11 v0dro

Thanks for the quick response. I tried some of your suggestions, but they didn't seem to help. The best I could do was convert my CSV into a hash of arrays and create the dataframe from that (speed up of about 2x, which is still pretty slow compared to just reading the CSV, which is pretty slow compared to non-Ruby CSV readers). I put up a gist with results here if you're interested: https://gist.github.com/gnilrets/611d85d5cb87fa31bb8a

I've struggled with getting Ruby to perform with larger data sets (https://github.com/gnilrets/Remi), and I worry that the language just isn't up to the task. Would love to be proven wrong.

gnilrets avatar Nov 12 '15 18:11 gnilrets

@gnilrets can you provide your test CSV (if it is not some very private data)? I'm now checking performance here and there during refactoring, and maybe there could be some things that could be improved immediately.

zverok avatar May 28 '16 20:05 zverok

I can't supply the CSV I used in that test. But here's some publicly available data from medicare (too big to attach directly, but still only a few 10k records): https://www.medicare.gov/download/DownloaddbInterim.asp

gnilrets avatar May 28 '16 22:05 gnilrets

I got similar relative benchmarks using both some of the wide and long datasets. Basically, Daru seems to take about 3-4x as long as just parsing the CSV. My suspicion is Daru uses CSV#by_col. If we process rows one-by-one to load a hash of arrays, we can improve the load process by about 2x.

gnilrets avatar May 28 '16 22:05 gnilrets

Alternatively we can create a C extension over libcsv as an nmatrix plugin and use that for loading data into dataframes.

https://github.com/SciRuby/nmatrix/issues/407

v0dro avatar May 29 '16 09:05 v0dro

Yes, but C extension is always somewhat like "last resort" (and JRuby guys will hate it, I suppose), so my first guess is always trying to profile/optimize Ruby.

Currently, I've investigated it to the point where it can be understood that CSV library itself performs pretty bad when provided with :numeric converter. I'll try to invent something simple-yet-clever around it :)

zverok avatar May 29 '16 15:05 zverok

No we'll keep it MRI specific. JRuby should have another library for CSV importing (I think jCSV from Rodrigo Botafogo can do the job - https://github.com/rbotafogo/jCSV

v0dro avatar May 29 '16 18:05 v0dro

#170

v0dro avatar Jun 24 '16 17:06 v0dro