daru icon indicating copy to clipboard operation
daru copied to clipboard

CSV reading performance issue.

Open sivagollapalli opened this issue 8 years ago • 17 comments

I just tried to read a csv file of size 42.6 MB which consists of 550241 records. When I tried with daru here is following statistics

# analysis.rb
require 'daru'
df = Daru::DataFrame.from_csv('./IndiaAffectedWaterQualityAreas.csv')
p df.first(2)

$ time ruby analysis.rb

real	0m48.878s
user	0m47.898s
sys	0m0.798s

if I do the same with python then it shows as follows

# analysis.py
import numpy as np # linear algebra
import pandas as pd
data = pd.read_csv('./IndiaAffectedWaterQualityAreas.csv', encoding = "ISO-8859-1")
print(data.tail(25))

$ time python analysis.py

real	0m1.649s
user	0m1.439s
sys	0m0.195s

Since Github doesn't allow a file to upload more than 10 MB so I couldn't upload. But you can get the data from https://www.kaggle.com/venkatramakrishnan/india-water-quality-data

sivagollapalli avatar Apr 26 '17 05:04 sivagollapalli

Unfortunately, that is for now "how it is". Nothing is broken in library itself that easily can be optimized. When profiling this case, most of the time spent into Ruby's CSV parser, in things like converting strings to numbers and so on. There are some plans/efforts to replace Ruby's default CSV library with something faster, but none of them are close to implementation.

zverok avatar Apr 26 '17 10:04 zverok

I think we should shift to a C based CSV parser like paratext. See https://github.com/SciRuby/daru/issues/170

v0dro avatar Apr 29 '17 16:04 v0dro

What about smarter_csv?

  • able to process large CSV-files
  • able to chunk the input from the CSV file to avoid loading the whole CSV file into memory
  • return a Hash for each line of the CSV file, so we can quickly use the results for either creating MongoDB or ActiveRecord entries, or further processing with Resque
  • able to pass a block to the process method, so data from the CSV file can be directly processed (e.g. Resque.enqueue )
  • allows to have a bit more flexible input format, where comments are possible, and col_sep,row_sep can be set to any character sequence, including control characters.
  • able to re-map CSV "column names" to Hash-keys of your choice (normalization)
  • able to ignore "columns" in the input (delete columns)
  • able to eliminate nil or empty fields from the result hashes (default)

info-rchitect avatar Jun 08 '17 11:06 info-rchitect

Here is a ~10x speed-up improvement workaround over the default :from_csv method. The CSV file in this case is ~65k rows and 40 columns. I expect the performance improvement to roll off as the CSV file size increases because this method loads the whole file into memory.

image

info-rchitect avatar Jun 08 '17 13:06 info-rchitect

@athityakumar can you explore if it is possible to optimize from_csv using @info-rchitect's method?

v0dro avatar Jun 13 '17 15:06 v0dro

@v0dro - Sure, I'll try re-creating the benchmarks (also with smarter_csv) and get back in a couple of days.

athityakumar avatar Jun 14 '17 10:06 athityakumar

Hey all. Please find the benchmark results for importing from csv below, which was done on @sivagollapalli's input csv file, which should create a <550242*8> Daru::DataFrame.

@v0dro - As suggested by @zverok, the default options of #from_csv (mainly converters: :numeric) are slowing it down. Would it be better if it isn't set as default?

  • The approach suggested by @info-rchitect does come out to be atleast 5-6x faster than the existing #from_csv method. But that's because, it's not used with converters (convert to numeric, etc. which is set by default in #from_csv) and stuff. It finally scales out to be very close to the existing #from_csv, when used with converters: :numeric, header_converters: :symbol.

    In this regard, rcsv seems like a better alternative as it supports column-specific options like converters.

  • Though smarter_csv gives lots of user options, it by itself is much slower in parsing than stdlib csv.

  • For using smarter_csv with parallel processing of chunks, these chunks are available only after the whole CSV files has been read and isn't async yet (See https://github.com/tilo/smarter_csv/issues/66).

    smarter_csv + Parallel does speed up parsing by 5-6 times (when compared to without parallel), and 1.5-2 times when compared to existing #from_csv. The one shown in benchmark is with Parallel.

  • I think it'd be great to have support for both fastcsv and rcsv in daru (now that daru-io supports partial requires), while still keeping stdlib csv as the default parsing gem. Both are almost equally fast and 20x faster than existing #from_csv. 🎉

CSV Importer Comparisons
                          user      system    total     real
existing #from_csv     99.380000  1.290000 100.670000 (142.255080)
modified #from_csv     17.340000  0.470000  17.810000 ( 40.721766)
smartercsv             65.400000  1.000000  66.400000 (116.842460)
fastcsv                7.660000   0.380000   8.040000 ( 13.105506)
rcsv                   5.850000   0.210000   6.060000 (  7.756945)

Also, have a look at this benchmark.

@sivagollapalli - For your current use-case, please try using rcsv / fastcsv as a workaround.

require 'daru'
require 'rcsv'
df  = Daru::DataFrame.rows Rcsv.parse(File.open('path/to/water.csv'))
require 'daru'
require 'fastcsv'
all = []
File.open('path/to/water.csv') { |f| FastCSV.raw_parse(f) { |row| all.push row } }
df = Daru::DataFrame.rows all[1..-1], order: all[0]

athityakumar avatar Jun 15 '17 07:06 athityakumar

Converters is definitely slow in CSV. So probably we should go around them (or around entire library) for any reasonable speed improvements.

zverok avatar Jun 20 '17 12:06 zverok

I wonder if :converters need to be re-looked at from a design perspective. Currently :numeric is supported but that's a bit of a special case IMHO. From a design perspective, If you look at issue #353 regarding date column support in CSVs, should the date column support be a new :converter or a date_columns argument to the from_csv function as mentioned in the report? I feel whatever solution finally is picked should support "converters" in a standard way with perhaps some like numeric and date provided by Daru but also allowing a user to add to these converters easily. The way to add/extend converters should standardized and well documented. There may be other cases for converters like IP address etc. which the user may have.

Regarding the numeric converter, I noticed that it tends to mix types a bit. While reading a values such as 10.1, 20, 12.22, 15, 111.2 it ends up typing 20 and 15 as Integer while the rest are Float. Perhaps there needs to be a more precise control on the type to avoid subtle bugs e.g. if a novice user thinks the entire column is Float and ends up getting a 6 for 20/3 and expects a 6.67 thinking the column read is a Float. Basically, I feel any converter should convert a column to only a mix of a "specific" type or nil based on content.

Perhaps having a standardized way to handle converters would help us decouple converters and CSV reading performance concerns. This would also future proof us a bit in handling any new column types the users might need.

parthm avatar Jun 21 '17 09:06 parthm

@parthm - Daru::DataFrame#from_csv's :converters sets the default as :numeric, and passes it onto stdlib CSV. Hence, I don't think date_columns should be clubbed with :converter option, but rather be a separate date_columns option.

I partly feel that de-coupling converters from CSV parsing might be better. But, we also have to consider that someone who has been using a specific CSV parsing gem (say, rcsv) would feel more comfortable, when the corresponding Daru::DataFrame#from_rcsv method's options are directly passed on to rcsv's options (like column-specific converters, which is already implemented in rcsv) rather than custom converter options.

athityakumar avatar Jun 21 '17 10:06 athityakumar

I understand how automatic conversion could useful for quick ad hoc data analysis, but as an ETL developer using Daru, I need to maintain tight control of my data types (e.g., I want my job to fail early if I get a CSV file containing a string in a date column; or, I want to make sure that leading zeroes are retained for business id columns). To this end, I have turned off all automatic type conversion for CSV reading. Everything is read in as a string, and I have built custom methods to enforce types.

gnilrets avatar Jun 21 '17 15:06 gnilrets

@gnilrets do you expect csv "booleans" to convert to Ruby booleans or stay as strings? Pandas seems to default to converting to booleans while Daru seems to keep them as strings.

baarkerlounger avatar Jul 13 '17 22:07 baarkerlounger

I guess if you're using automatic conversion, then yes, they should convert to booleans.

gnilrets avatar Jul 13 '17 23:07 gnilrets

What do you mean by automatic conversion? I just used the standard Daru::DataFrame.from_csv(file) and expected that they would convert but it seems they don't.

baarkerlounger avatar Jul 14 '17 10:07 baarkerlounger

By default Daru uses automatic conversion from the core Ruby CSV reader. Sounds like it might not be doing so for booleans.

gnilrets avatar Jul 14 '17 15:07 gnilrets

It looks like the Ruby CSV reader doesn't have a built-in converter for booleans. Should we handle this in Daru?

https://docs.ruby-lang.org/en/2.1.0/CSV.html#Converters

baarkerlounger avatar Jul 14 '17 15:07 baarkerlounger

@db579 works!

v0dro avatar Jul 18 '17 02:07 v0dro