stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

Implement `parse-dsv`

Open kgryte opened this issue 8 years ago • 17 comments

Delimiter-separated values. @stdlib/utils/parse-dsv.

Note that this package is not simple. The implementation should adhere as best as possible to the various CSV RFCs and pass standard CSV parsing tests.

Ideally, this implementation should be portable and/or share a code base with a streaming variant.

kgryte avatar Dec 10 '16 01:12 kgryte

While googling around and deciding on a course of action I found this package on npm to test an implementation.

Testing the implementation of d3-dsv this way I run into some trouble. While the output is visibly the same a test for equality failed.

Turns out you need to use tape's deepEqual method in this case. And use the spread operator to create an array without a columns property. For future reference, this is the code I used to test the implementation

'use strict';

const spectrum = require('csv-spectrum');
const d3DSV = require('d3-dsv');
const tape = require('tape');

tape( 'Test all cases', function test ( t ) {

    spectrum( function ( err, data ) {

        for ( let testCase of data ) {
            // Convert the data to a string
            let csvDataString = testCase.csv.toString( 'utf8' );

            // Create a new array without the columns property which breaks the equality test
            let parsed = [ ...d3DSV.csvParse( csvDataString ) ];
            let control = JSON.parse( testCase.json);

            // Test type of parsed objects
            t.equal( typeof parsed, typeof control, 'testing types of sample: ' + testCase.name );

            // Test equality by value
            t.deepEqual( parsed, control, 'testing sample: ' + testCase.name );
        }
    });

    t.end();
});

labiej avatar Feb 06 '17 13:02 labiej

@labiej Thanks for looking into this!

kgryte avatar Feb 06 '17 19:02 kgryte

Ref: https://github.com/d3/d3-dsv

kgryte avatar Jul 20 '22 00:07 kgryte

CSV-spectrum: https://github.com/maxogden/csv-spectrum

kgryte avatar Jul 20 '22 00:07 kgryte

Papaparse: https://github.com/mholt/PapaParse

kgryte avatar Jul 22 '22 07:07 kgryte

RFC: https://datatracker.ietf.org/doc/html/rfc4180

kgryte avatar Jul 22 '22 19:07 kgryte

Python built-in CSV API: https://docs.python.org/3/library/csv.html

PEP 305: https://peps.python.org/pep-0305/

kgryte avatar Jul 22 '22 19:07 kgryte

GoLang: https://pkg.go.dev/encoding/csv

kgryte avatar Jul 22 '22 19:07 kgryte

MATLAB: https://www.mathworks.com/help/matlab/ref/csvread.html and https://www.mathworks.com/help/matlab/ref/readmatrix.html

MATLAB's API is interesting insofar as it supports reading only sections of a CSV file.

kgryte avatar Jul 22 '22 19:07 kgryte

Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Has arguably the most complex CSV read API. Some interesting features.

  • Select particular columns.
  • Ability to provide custom converters for particular columns.
  • Comment support.
  • Custom separator support (decimals, thousands, etc).
  • Dialects (as in native Python CSV API)
  • Support for custom date parsing.
  • Support for specifying a particular column as a column of row labels

kgryte avatar Jul 22 '22 20:07 kgryte

Streaming CSV parser for Node: https://github.com/mafintosh/csv-parser; however, issue tracker suggests some concerns with implementation.

kgryte avatar Jul 22 '22 20:07 kgryte

R: https://www.rdocumentation.org/packages/qtl2/versions/0.28/topics/read_csv

kgryte avatar Jul 22 '22 20:07 kgryte

Parsing options:

  • delimiter: delimiter to use. For CSV, the value would be ,. For TSV, the value would be a TAB character.
  • thousands: thousands separator. This would allow numbers to be written 1,000,000.
  • decimal: decimal separator. This helps with European data, which format 3.14 as 3,14.
  • quote: character sequence used to denote the start and end of a quoted item. Quoted items can include the delimiter and the delimiter must be ignored.
  • true: array of values to be considered equal to true. E.g., the string 'True' would be converted to true, a boolean.
  • false: array of values to be considered equal to false.
  • quoted: list of columns which may have quoted field values or false, indicating that no fields should be quoted. (update: not sure the original intent of this option. However, may be to provide a fast path for parsing. In which case, a value of true could mean that all fields may contain quoted field values.)
  • doublequote: boolean indicating whether to interpret two consecutive quote character sequences inside a quoted field as a single quote character sequence.
  • comment: character sequence indicating whether the remainder of a line should not be parsed.
  • escape: character sequence used to escape other characters.
  • columns: list of columns to return. Default is to return all columns.
  • missing: list of strings to recognize as missing values (e.g., NA, NaN, null, etc).
  • whitespace: list of characters to interpret as whitespace.
  • trim: boolean indicating whether to trim leading whitespace in each field value.
  • trimNonNumeric: boolean indicating whether to trim non-numeric characters from a numeric value.
  • transforms: object whose properties are column numbers and whose values are callbacks which should be invoked for the respective column values and which return a transformed value. E.g., a callback which converts a string to a Date object.
  • consecutiveDelimiters: rule specifying how to handle consecutive delimiters: keep, join, error.
  • leadingDelimiters: rule specifying how to handle leading delimiters: keep, ignore, error.
  • trailingDelimiters: rule specifying how to handle trailing delimiters: keep, ignore, error.

kgryte avatar Jul 23 '22 00:07 kgryte

Prospective API design:

readDSVLine( [options] )

Returns a function for reading a single DSV line according to provided options.

var opts = {
    'delimiter': ',',
    'comment': '#',
    'whitespace': [ ' ' ]
};

var reader = readDSVLine( opts );
// returns <Function>
  • If options is not provided, default values are used.

reader( line )

Parses a single DSV line and returns an array of values.

var reader = readDSVLine( {} );

var line = reader( 'foo,bar,beep,boop' );
// returns [ 'foo', 'bar', 'beep', 'boop' ]
  • If the reader is unable to parse a provided line, the function must return null.

reader.assign( line, out, stride, offset )

Parses a single DSV line and assigns field values to elements in the provided output array.

var reader = readDSVLine( {} );

var out = [ null, null, null, null, null, null, null, null ];

var o = reader.assign( 'foo,bar,beep,boop', out, 2, 1 );
// returns [ null, 'foo', null, 'bar', null, 'beep', null, 'boop' ]

var bool = ( o === out );
// returns true
  • If unable to parse a line, the method should return null.
  • Users should beware that, if the method returns null, elements in the provided output array could have still been mutated.

kgryte avatar Jul 23 '22 00:07 kgryte

For line-by-line reader, proposed package: @stdlib/utils/dsv/base/parse-line.

Once the line-by-line reader is implemented, can consider a "sniff" package and other CSV/DSV abstraction packages.

kgryte avatar Jul 23 '22 04:07 kgryte

Hi @kgryte, is this issue still open? I see that we have implemented an incremental parser here already: @stdlib/utils/dsv/base/parse. Is this open issue now a matter of creating a wrapper around it?

Infinage avatar Jun 03 '23 12:06 Infinage

No, not yet. This issue is blocked until the base implementation is finished, which it is not.

kgryte avatar Jun 03 '23 19:06 kgryte