stdlib
stdlib copied to clipboard
Implement `parse-dsv`
Delimiter-separated values. @stdlib/utils/parse-dsv
.
Note that this package is not simple. The implementation should adhere as best as possible to the various CSV RFCs and pass standard CSV parsing tests.
Ideally, this implementation should be portable and/or share a code base with a streaming variant.
While googling around and deciding on a course of action I found this package on npm to test an implementation.
Testing the implementation of d3-dsv this way I run into some trouble. While the output is visibly the same a test for equality failed.
Turns out you need to use tape's deepEqual
method in this case. And use the spread operator to create an array without a columns
property.
For future reference, this is the code I used to test the implementation
'use strict';
const spectrum = require('csv-spectrum');
const d3DSV = require('d3-dsv');
const tape = require('tape');
tape( 'Test all cases', function test ( t ) {
spectrum( function ( err, data ) {
for ( let testCase of data ) {
// Convert the data to a string
let csvDataString = testCase.csv.toString( 'utf8' );
// Create a new array without the columns property which breaks the equality test
let parsed = [ ...d3DSV.csvParse( csvDataString ) ];
let control = JSON.parse( testCase.json);
// Test type of parsed objects
t.equal( typeof parsed, typeof control, 'testing types of sample: ' + testCase.name );
// Test equality by value
t.deepEqual( parsed, control, 'testing sample: ' + testCase.name );
}
});
t.end();
});
@labiej Thanks for looking into this!
Ref: https://github.com/d3/d3-dsv
CSV-spectrum: https://github.com/maxogden/csv-spectrum
Papaparse: https://github.com/mholt/PapaParse
RFC: https://datatracker.ietf.org/doc/html/rfc4180
Python built-in CSV API: https://docs.python.org/3/library/csv.html
PEP 305: https://peps.python.org/pep-0305/
GoLang: https://pkg.go.dev/encoding/csv
MATLAB: https://www.mathworks.com/help/matlab/ref/csvread.html and https://www.mathworks.com/help/matlab/ref/readmatrix.html
MATLAB's API is interesting insofar as it supports reading only sections of a CSV file.
Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Has arguably the most complex CSV read API. Some interesting features.
- Select particular columns.
- Ability to provide custom converters for particular columns.
- Comment support.
- Custom separator support (decimals, thousands, etc).
- Dialects (as in native Python CSV API)
- Support for custom date parsing.
- Support for specifying a particular column as a column of row labels
Streaming CSV parser for Node: https://github.com/mafintosh/csv-parser; however, issue tracker suggests some concerns with implementation.
R: https://www.rdocumentation.org/packages/qtl2/versions/0.28/topics/read_csv
Parsing options:
-
delimiter: delimiter to use. For CSV, the value would be
,
. For TSV, the value would be aTAB
character. -
thousands: thousands separator. This would allow numbers to be written
1,000,000
. -
decimal: decimal separator. This helps with European data, which format
3.14
as3,14
. - quote: character sequence used to denote the start and end of a quoted item. Quoted items can include the delimiter and the delimiter must be ignored.
-
true: array of values to be considered equal to
true
. E.g., the string'True'
would be converted totrue
, a boolean. -
false: array of values to be considered equal to
false
. -
quoted: list of columns which may have quoted field values or
false
, indicating that no fields should be quoted. (update: not sure the original intent of this option. However, may be to provide a fast path for parsing. In which case, a value oftrue
could mean that all fields may contain quoted field values.) -
doublequote: boolean indicating whether to interpret two consecutive
quote
character sequences inside a quoted field as a singlequote
character sequence. - comment: character sequence indicating whether the remainder of a line should not be parsed.
- escape: character sequence used to escape other characters.
- columns: list of columns to return. Default is to return all columns.
-
missing: list of strings to recognize as missing values (e.g.,
NA
,NaN
,null
, etc). - whitespace: list of characters to interpret as whitespace.
- trim: boolean indicating whether to trim leading whitespace in each field value.
- trimNonNumeric: boolean indicating whether to trim non-numeric characters from a numeric value.
-
transforms: object whose properties are column numbers and whose values are callbacks which should be invoked for the respective column values and which return a transformed value. E.g., a callback which converts a string to a
Date
object. -
consecutiveDelimiters: rule specifying how to handle consecutive delimiters:
keep
,join
,error
. -
leadingDelimiters: rule specifying how to handle leading delimiters:
keep
,ignore
,error
. -
trailingDelimiters: rule specifying how to handle trailing delimiters:
keep
,ignore
,error
.
Prospective API design:
readDSVLine( [options] )
Returns a function for reading a single DSV line according to provided options.
var opts = {
'delimiter': ',',
'comment': '#',
'whitespace': [ ' ' ]
};
var reader = readDSVLine( opts );
// returns <Function>
- If
options
is not provided, default values are used.
reader( line )
Parses a single DSV line and returns an array of values.
var reader = readDSVLine( {} );
var line = reader( 'foo,bar,beep,boop' );
// returns [ 'foo', 'bar', 'beep', 'boop' ]
- If the reader is unable to parse a provided line, the function must return
null
.
reader.assign( line, out, stride, offset )
Parses a single DSV line and assigns field values to elements in the provided output array.
var reader = readDSVLine( {} );
var out = [ null, null, null, null, null, null, null, null ];
var o = reader.assign( 'foo,bar,beep,boop', out, 2, 1 );
// returns [ null, 'foo', null, 'bar', null, 'beep', null, 'boop' ]
var bool = ( o === out );
// returns true
- If unable to parse a line, the method should return
null
. - Users should beware that, if the method returns
null
, elements in the provided output array could have still been mutated.
For line-by-line reader, proposed package: @stdlib/utils/dsv/base/parse-line
.
Once the line-by-line reader is implemented, can consider a "sniff" package and other CSV/DSV abstraction packages.
Hi @kgryte, is this issue still open? I see that we have implemented an incremental parser here already: @stdlib/utils/dsv/base/parse
. Is this open issue now a matter of creating a wrapper around it?
No, not yet. This issue is blocked until the base implementation is finished, which it is not.