csv icon indicating copy to clipboard operation
csv copied to clipboard

csv parser, optimized for performance

trafficstars

CSV-file parser


Copyright (C) 2012, Dmitry Kolesnikov

This file is free documentation; unlimited permisions are give to copy, distribute and modify the documentation.

This library is free software; you can redistribute it and/or modify it under the terms of the the 3-clause BSD License (the "License"); as published by http://www.opensource.org/licenses/BSD-3-Clause.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!! !!! !!! WARNING !!! !!! The library is not supported. !!! !!! Use CSV feature of https://github.com/fogfish/feta !!! !!! !!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Introduction

The simple CSV-file parser based on event model. The parser generates an
event/callback when the CSV line is parsed. The parser supports both
sequential and parallel parsing. The major goal is an performance of 
intake procedure with an parsing target of 3 - 4 micro seconds pr line on
the reference hardware.

                       Acc
                   +--------+
                   |        |
                   V        |
              +---------+   |
----Input---->| Parser  |--------> AccN
      +       +---------+
     Acc0          |
                   V
               Event Line 

The parser takes as input binary stream, event handler function and
initial state/accumulator. Event function is evaluated agains current 
accumulator and parsed line of csv-file. Note: The accumaltor allows to
carry-on application specific state throught event functions.

Compile and build

The library source code is available at git repository

git clone https://github.com/fogfish/pts.git

Briefly, the shell command ./configure; make; make install' should configure, build, and assembly distribution package. The following instructions are specific to this package; see the INSTALL' file for instructions specific to GNU build tools.

The `configure' shell script attempts to guess dependencies and system configuration required to build library, the following build time dependencies exists:

--with-erlang={prefix_to_otp} supplied to `./configure' binds the library with chosen Erlang runtime, if you have multiple Erlang environments available at build machine

High performance version of library shall be build with native targets

make BUILD=native

Interface

Briefly, the sequence of operations for data parse/intake is following; see the src/csv.erl file for detailed interface specification and/or example parser at priv/csv_example.erl

%% define an event funtion that takes two arguments line value and %% accumulator. The function shall return a new accumulator state. %% The structure of accumulator is an application specific, that might %% vary from integer to comprex record. Fun = fun({line, L}, #my_record{count = C} = Acc0) -> do_my_intake_to_somewhere(lists:reverse(L)), Acc0#my_record{count = C + 1} end

%% %% A sequential parse, parses whole data stream in client process csv:parse(CSV, Fun, #myrecord{})

%% %% a parallel parse splits the CSV into multiple chunks; %% spawns multiple processes (process per chunk) %% results agregated in the client process. csv:parse(CSV, 20, Fun, #myrecord{})

Performance

Reference platform: * MacMini, Lion Server, * 1x Intel Core i7 (2 GHz), 4x cores * L2 Cache 256KB per core * L3 Cache 6MB * Memory 4GB 1333 MHZ DDR3 * Disk 750GB 7200rpm WDC WD7500BTKT-40MD3T0 * erlang R15B + native build of the library

The data set is has following patterns: key, date, time, float numbers and zz suffix * key{1..300 000},2012-03-25,23:26:15.543,166.280,...,zz

The numbers of keys is 300.000, and number of float fields varies from 8, 24 and 40 in reference data. Reference data set is generated by command

make example or perl priv/gen_set.pl 300 40 > priv/set-300K-40.txt

version 0.0.1

E/Parse Size (MB) Read (ms) Handle (ms) Per Line (us)

300K, 8 flds 23.41 91.722 350.000 1.16 300K, 24 flds 50.42 489.303 697.739 2.33 300K, 40 flds 77.43 780.296 946.003 3.15

ET/hash Size (MB) Read (ms) Handle (ms) Per Line (us)

300K, 8 flds 23.41 91.722 384.598 1.28 300K, 24 flds 50.42 489.303 761.414 2.54 300K, 40 flds 77.43 780.296 1047.329 3.49

ET/tuple Size (MB) Read (ms) Handle (ms) Per Line (us)

300K, 8 flds 23.41 91.722 228.306 0.76 300K, 24 flds 50.42 489.303 601.025 2.00 300K, 40 flds 77.43 780.296 984.676 3.28

ETL/ets Size (MB) Read (ms) Handle (ms) Per Line (us)

300K, 8 flds 23.41 91.722 1489.543 4.50 300K, 24 flds 50.42 489.303 2249.689 7.50 300K, 40 flds 77.43 780.296 2519.401 8.39

ETL/pts Size (MB) Read (ms) Handle (ms) Per Line (us)

300K, 8 flds 23.41 91.722 592.886 1.98 300K, 24 flds 50.42 489.303 1190.745 3.97 300K, 40 flds 77.43 780.296 1734.898 5.78