daru
daru copied to clipboard
CSV reading performance issue.
I just tried to read a csv file of size 42.6 MB which consists of 550241 records. When I tried with daru here is following statistics
# analysis.rb
require 'daru'
df = Daru::DataFrame.from_csv('./IndiaAffectedWaterQualityAreas.csv')
p df.first(2)
$ time ruby analysis.rb
real 0m48.878s
user 0m47.898s
sys 0m0.798s
if I do the same with python then it shows as follows
# analysis.py
import numpy as np # linear algebra
import pandas as pd
data = pd.read_csv('./IndiaAffectedWaterQualityAreas.csv', encoding = "ISO-8859-1")
print(data.tail(25))
$ time python analysis.py
real 0m1.649s
user 0m1.439s
sys 0m0.195s
Since Github doesn't allow a file to upload more than 10 MB so I couldn't upload. But you can get the data from https://www.kaggle.com/venkatramakrishnan/india-water-quality-data
Unfortunately, that is for now "how it is". Nothing is broken in library itself that easily can be optimized. When profiling this case, most of the time spent into Ruby's CSV parser, in things like converting strings to numbers and so on. There are some plans/efforts to replace Ruby's default CSV library with something faster, but none of them are close to implementation.
I think we should shift to a C based CSV parser like paratext. See https://github.com/SciRuby/daru/issues/170
What about smarter_csv?
- able to process large CSV-files
- able to chunk the input from the CSV file to avoid loading the whole CSV file into memory
- return a Hash for each line of the CSV file, so we can quickly use the results for either creating MongoDB or ActiveRecord entries, or further processing with Resque
- able to pass a block to the process method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
- allows to have a bit more flexible input format, where comments are possible, and col_sep,row_sep can be set to any character sequence, including control characters.
- able to re-map CSV "column names" to Hash-keys of your choice (normalization)
- able to ignore "columns" in the input (delete columns)
- able to eliminate nil or empty fields from the result hashes (default)
Here is a ~10x speed-up improvement workaround over the default :from_csv method. The CSV file in this case is ~65k rows and 40 columns. I expect the performance improvement to roll off as the CSV file size increases because this method loads the whole file into memory.

@athityakumar can you explore if it is possible to optimize from_csv using @info-rchitect's method?
@v0dro - Sure, I'll try re-creating the benchmarks (also with smarter_csv) and get back in a couple of days.
Hey all. Please find the benchmark results for importing from csv below, which was done on @sivagollapalli's input csv file, which should create a <550242*8> Daru::DataFrame.
@v0dro - As suggested by @zverok, the default options of #from_csv (mainly converters: :numeric) are slowing it down. Would it be better if it isn't set as default?
-
The approach suggested by @info-rchitect does come out to be atleast 5-6x faster than the existing
#from_csvmethod. But that's because, it's not used with converters (convert to numeric, etc. which is set by default in#from_csv) and stuff. It finally scales out to be very close to the existing#from_csv, when used withconverters: :numeric, header_converters: :symbol.In this regard,
rcsvseems like a better alternative as it supports column-specific options like converters. -
Though
smarter_csvgives lots of user options, it by itself is much slower in parsing than stdlib csv. -
For using
smarter_csvwith parallel processing of chunks, these chunks are available only after the whole CSV files has been read and isn't async yet (See https://github.com/tilo/smarter_csv/issues/66).smarter_csv+Paralleldoes speed up parsing by 5-6 times (when compared to without parallel), and 1.5-2 times when compared to existing#from_csv. The one shown in benchmark is withParallel. -
I think it'd be great to have support for both fastcsv and rcsv in daru (now that daru-io supports partial requires), while still keeping stdlib csv as the default parsing gem. Both are almost equally fast and 20x faster than existing
#from_csv. 🎉
CSV Importer Comparisons
user system total real
existing #from_csv 99.380000 1.290000 100.670000 (142.255080)
modified #from_csv 17.340000 0.470000 17.810000 ( 40.721766)
smartercsv 65.400000 1.000000 66.400000 (116.842460)
fastcsv 7.660000 0.380000 8.040000 ( 13.105506)
rcsv 5.850000 0.210000 6.060000 ( 7.756945)
Also, have a look at this benchmark.
@sivagollapalli - For your current use-case, please try using rcsv / fastcsv as a workaround.
require 'daru'
require 'rcsv'
df = Daru::DataFrame.rows Rcsv.parse(File.open('path/to/water.csv'))
require 'daru'
require 'fastcsv'
all = []
File.open('path/to/water.csv') { |f| FastCSV.raw_parse(f) { |row| all.push row } }
df = Daru::DataFrame.rows all[1..-1], order: all[0]
Converters is definitely slow in CSV. So probably we should go around them (or around entire library) for any reasonable speed improvements.
I wonder if :converters need to be re-looked at from a design perspective. Currently :numeric is supported but that's a bit of a special case IMHO. From a design perspective, If you look at issue #353 regarding date column support in CSVs, should the date column support be a new :converter or a date_columns argument to the from_csv function as mentioned in the report? I feel whatever solution finally is picked should support "converters" in a standard way with perhaps some like numeric and date provided by Daru but also allowing a user to add to these converters easily. The way to add/extend converters should standardized and well documented. There may be other cases for converters like IP address etc. which the user may have.
Regarding the numeric converter, I noticed that it tends to mix types a bit. While reading a values such as 10.1, 20, 12.22, 15, 111.2 it ends up typing 20 and 15 as Integer while the rest are Float. Perhaps there needs to be a more precise control on the type to avoid subtle bugs e.g. if a novice user thinks the entire column is Float and ends up getting a 6 for 20/3 and expects a 6.67 thinking the column read is a Float. Basically, I feel any converter should convert a column to only a mix of a "specific" type or nil based on content.
Perhaps having a standardized way to handle converters would help us decouple converters and CSV reading performance concerns. This would also future proof us a bit in handling any new column types the users might need.
@parthm - Daru::DataFrame#from_csv's :converters sets the default as :numeric, and passes it onto stdlib CSV. Hence, I don't think date_columns should be clubbed with :converter option, but rather be a separate date_columns option.
I partly feel that de-coupling converters from CSV parsing might be better. But, we also have to consider that someone who has been using a specific CSV parsing gem (say, rcsv) would feel more comfortable, when the corresponding Daru::DataFrame#from_rcsv method's options are directly passed on to rcsv's options (like column-specific converters, which is already implemented in rcsv) rather than custom converter options.
I understand how automatic conversion could useful for quick ad hoc data analysis, but as an ETL developer using Daru, I need to maintain tight control of my data types (e.g., I want my job to fail early if I get a CSV file containing a string in a date column; or, I want to make sure that leading zeroes are retained for business id columns). To this end, I have turned off all automatic type conversion for CSV reading. Everything is read in as a string, and I have built custom methods to enforce types.
@gnilrets do you expect csv "booleans" to convert to Ruby booleans or stay as strings? Pandas seems to default to converting to booleans while Daru seems to keep them as strings.
I guess if you're using automatic conversion, then yes, they should convert to booleans.
What do you mean by automatic conversion? I just used the standard Daru::DataFrame.from_csv(file) and expected that they would convert but it seems they don't.
By default Daru uses automatic conversion from the core Ruby CSV reader. Sounds like it might not be doing so for booleans.
It looks like the Ruby CSV reader doesn't have a built-in converter for booleans. Should we handle this in Daru?
https://docs.ruby-lang.org/en/2.1.0/CSV.html#Converters
@db579 works!