stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

Implement parse-dsv

Open rei2hu opened this issue 5 years ago • 18 comments

Resolves #53.

Checklist

Please ensure the following tasks are completed before submitting this pull request.

  • [x] Read, understood, and followed the contributing guidelines, including the relevant style guides.
  • [x] Read and understand the Code of Conduct.
  • [x] Read and understood the licensing terms.
  • [x] Searched for existing issues and pull requests before submitting this pull request.
  • [x] Filed an issue (or an issue already existed) prior to submitting this pull request.
  • [x] Rebased onto latest develop.
  • [x] Submitted against develop branch.

Description

What is the purpose of this pull request?

This pull request:

  • Implements the parse-dsv function

Related Issues

Does this pull request have any related issues?

This pull request:

  • resolves #53

Questions

Any questions for reviewers of this pull request?

What did I use that I shouldn't use for this library and what are the replacements? e.g. buffer/fs/nesting functions etc. I already know const/let and general style are problems.

Any implementation suggestions?

Other

Any other information relevant to this pull request? This may include screenshots, references, and/or implementation notes.

The current implementation pretty strictly follows section 2 in RFC 4180 and works by reading a file byte by byte. I also assume that the delimiter will 1 byte long. I plan on changing it to handle streams which should be straightforward enough as it already works by reading a file byte by byte.

Some comments on differences between the current implementation and what the RFC describes:

  1. Each record is located on a separate line, delimited by a line break (CRLF).

The implementation allows for CR, LF, or CRLF for line breaks.

  1. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type).

I'm ignoring headers for now. Right now the implementation returns a 2d array representing rows/columns.


@stdlib-js/reviewers

rei2hu avatar Nov 05 '18 01:11 rei2hu

@rei2hu Whoa! This is awwwwwwwesssommmme! This has definitely been high on the wish list for some time! Will take a deeper look this evening!

kgryte avatar Nov 05 '18 04:11 kgryte

I've pushed the tests I've been using (I realize I'll have to rewrite their structure too eventually).

Six of them are failing and I have some brief comments explaining why each one fails except for regex.csv which seems to be far out of the bounds of the RFC.

rei2hu avatar Nov 05 '18 04:11 rei2hu

@rei2hu I suppose initial feedback would be whether the file reading can be separated from parsing DSV. My thinking here is whether we might be able to create a parsing engine which "incrementally" parses provided data. Borrowing a bit of inspiration from @stdlib/stats/incr/*, the idea would be something like

function createParser( options ) {
    // Internal state configuration...

    return incrparse;

    function incrparse( byte ) {
        // Parse byte and update internal state...
    }

    // Various helper functions for parsing provided data...
}

My sense is that an incremental engine of this sort could then be used in a multiple contexts (e.g., streams, reading a file, etc), where wrapper code would read/pull data and then pass along to the incremental parsing engine.

Just a thought.

kgryte avatar Nov 05 '18 04:11 kgryte

@rei2hu Considering an incremental parser, I suppose maybe better than a byte-by-byte parser would be an incremental parser which accepts arbitrarily sized chunks (per update).

kgryte avatar Nov 05 '18 05:11 kgryte

Alright, I think I got something working for that kind of implementation but instead of taking in raw bytes it takes in characters. This lets it pass the utf8 test without doing anything fancy. Also I put a quick if statement to handle the test where the delimiter is ; instead of , so it also passes that test now.

rei2hu avatar Nov 05 '18 07:11 rei2hu

@rei2hu Couple questions:

  1. Have you considered adding support for inferring the separator from provided data?
  2. What is the strategy for dealing with headers?
  3. Should we make the incrparse function an event emitter? Namely, such that we emit events upon parsing headers and each row?

kgryte avatar Nov 06 '18 03:11 kgryte

  1. Not really although if it's wanted I have a few ideas on how to do it although it would be very computationally expensive. For a "perfect" solution, the implementation would have to find a substring that exists in every row that splits each row into equal portions while also considering quoted parts and other stuff.

    There are various ways to lower the cost although accuracy might take a hit:

    • limiting the length of the assumed delimiter
    • ignoring things like quotes
    • just getting rough counts and picking whatever is best
  2. If a header option is provided, the first row will contain column data. Then instead of a 2d array being returned, a 1d array of objects will be returned. A few test cases should have this as a format already I just transform them into 2d arrays so they pass. For example escaped_quotes.json. The implementation itself should be straightforward by just keeping a separate variable for header data and setting it once row === 1 which means that the first row that contained header data was already parsed.

  3. I thought about making it return something so that it could be used like

var stream_parser = dsv_parser();
readable_stream.pipe(stream_parser);

stream_parser.on('event', (x) => {
    // do something
});

I think the current implementation is better because you could just write to any writable stream and just put incrparse inside the callback.

I also had another idea where you could extract only part of the dsv data.

dsv_parser().start_row(10).end_row(20).start_column(5).end_column(10).read()

I think making it emit events would also work. I also thought about having no internal record of what has been read so far and rely on the user storing results in his/her own structures like

parser.on('textdata', (d, row, column) =>
    mystruct[row][column] += d;
});
parser.on('delimiter', (d, row) =>
    mystruct[row].push('');
});

which could be nice on memory. Perhaps we can go with that and have a version that keeps a record that just wraps the emitter version.

rei2hu avatar Nov 06 '18 04:11 rei2hu

Re: 1. One way to address computational cost is to make such cost tunable. For reference, Python has the builtin csv.Sniffer function. For us, may be good enough to, say, by default "sniff" the first ten rows/records. We can do this by maintaining an initial pool. Once the pool has been filled or null provided, we then analyze the data, split into rows, and then perform an analysis similar to what you suggest, based on tabulating character frequencies, etc. We can make this configurable by allowing the number of analyzed rows to be specified. If a user wants "perfect" inference, s/he can specify an enormous number of rows, thus leading to all CSV data being pooled in memory before analysis. Otherwise, if a user is okay with a quick look because the data is likely to be highly regular and well-formatted, then maybe s/he is willing to analyze only the first 1 or 2 rows after the header. By making it configurable, hopefully a user can make a judgement as to the perf/accuracy tradeoff.

Re: 2. Yeah, I am not certain what is preferred here. Not immediately obvious that we need to format a row as an object if a header row is present. Or alternatively, if, say, we allow the column headers to be provided as a configuration option. In both cases, we could just emit a header row (array) and then subsequently emit each row also as an array. If someone wanted key-value pairs, they could do a transformation. I feel like this would be a bit more efficient, even if slightly more inconvenient for those preferring key-value pairs.

Re: 3. Agreed. I like a consumption agnostic CSV parser, as is now. That I can wrap it with a stream interface or manually feed it data, etc, I find to be a net positive. We can readily create specialized wrappers once the incremental parser is feature complete, so I see no need to create a single "super" package.

As to your idea of parsing regions of CSV data, I think this is better left to other tools within an analysis pipeline (e.g., head, tail, etc). They don't need to be combined into a single "super" package. And even if we wanted to expose such functionality, this seems perfectly suited for another utility package which leverages what you have written thus far: a general purpose small core which can be deployed in numerous contexts.

As to event emitters and state, I think the only way for us to allow the incremental parser to be a "transform", rather than a "sink", is by using event emission (or providing callbacks, but event emission is a cleaner API with marginally more overhead due to delegation). The "sink" approach is problematic due to its need to keep all of the parsed data in memory before returning the final output. The "transform" approach is not particularly amenable to something like @stdlib/stats/incr/* because incomplete data chunks may mean that the incremental parser does not return a row per invocation. If chunk size is random, so, too, should parsing an entire row be random. This leads naturally to an event emission API.

In this case, I think you are correct: may be best to have the incremental parser maintain as little output state as possible and, if, say, a concatenation of all parsed rows is desired, then someone can combine the incremental parser with a "sink", or as you show in your last example, a receiver who transforms the data into a desired data structure. Granted, for specialized use cases, this is not likely to be the most performant (we could write more optimized versions which assume specific input data sizes and particular output formats, thus eliminating the event emission overhead); however, it is likely to be the most performant option on average across the various use cases. In sum, I think your recommendation is sensible.

kgryte avatar Nov 06 '18 09:11 kgryte

I changed the implementation to a class so it could inherit from EventEmitter. Right now it will only emit rows i.e. pd.on('row', (data, rowNumber) => ...). I tried to follow style from things in @stdlib/plot/components/svg which is where I found EventEmitter used.

I left in dsvState for now just because the tests use it which reminded me I'm not sure how I would end up structuring the tests if the implementation ends up discarding it. I think it will have to rely on an additional event for end of stream and a callback to work in order.

I also decided to differentiate between something like "","","" and ,,, where the former will be ['','',''] and the latter will be [null,null,null].

Also here is my first attempt at guessing the delimiter. I already know several cases where it could be wrong.

module.exports = (input, newline) => {
        input = input.split('\n');
        const TARGET_NUM = input.length;
        // target is something that appears y times in TARGET_NUM rows
        input = input.map(row => {
                let counts = {}; // counts of each character
                for (const char of row) {
                        counts[char] = (counts[char] || 0) + 1;
                }
                return counts;
                // counts of each char in each row
                // array of objects
        });
        // how many rows x appeard y times in
        input = input.reduce((acc, rowData) => {
                // for each row
                for (const char in rowData) {
                        const count = rowData[char];
                        // determine how many rows this char appeared count times in
                        const meta = input.filter(e => e[char] === count).length;
                        if (!acc[char]) acc[char] = {max: -1};
                        acc[char][count] = meta;
                        if (acc[char].max < meta) {
                                acc[char].max = meta;
                                acc[char].count = count;
                        }
                }
                return acc;
        }, {});
        let vals = Object.entries(input).sort((a, b) => {
                let max = b[1].max - a[1].max;
                if (max === 0) { // they somehow occurr y times in x amount of rows the same time
                        // whichever is more common per row,
                        // should probably also probably check character
                        return b[1].count - a[1].count;

                }
                return max;
        });
        console.log(vals.slice(0, 10)); // top 10 choices.
        return vals[0][0]; // return thing with highest max and highest count
}

rei2hu avatar Nov 06 '18 19:11 rei2hu

Current ideas I have for the sniffer/delimiter guesser:

  1. It's enabled through option that can be set when passing options to the DsvParser and will only take effect if options.delimiter isn't set.

  2. The DsvParser won't emit any rows until a delimiter can be determined. This will require a minimum number of rows to be passed in along with some other circumstances. Once determined, the DsvParser will emit an event called delimiter with the delimiter and an array of rows that have been read so far i.e. parser.on('delimiter', function( delimiter, rows) {})

  3. Another option on DsvParser for the user to pass in a set of delimiters to pick from. By default it can include things like commas/semicolons/other common delimiters. And also maybe a blacklist.

  4. There will be an additional method on DsvParser, something along the lines of forceDelimiterGuess which will force the delimiter event to be emitted along with the rows based on the information so far.

Other things:

Currently keeping the length of the guessed delimiter to be 1.

rei2hu avatar Nov 07 '18 22:11 rei2hu

Re: 2. A "delimiter" event is probably not necessary. I am not sure how that event would be used in userland. Maybe you have use cases in mind? Another approach would be that, once a delimiter is determined, the resolved rows are simply emitted one after the other using the "row" (?) event.

Re: 3. I think an allowed list could work. A disallowed list is probably less useful/necessary. In this case, the delimiter option could accept either a string or an array of strings (and more generally, a string/Buffer/Uint8Array and arrays of those of those types, in the spirit of Node.js fs methods which accept string/Buffer/Uint8Array input args for paths, etc).

Re: 4. I am not sure this proposed method is necessary. I am just trying to think when someone would want to do this (i.e., only determine a DSV file delimiter without actually parsing the CSV into rows). I will, however, backtrack a bit. I can see the value of a "delimiter" event, but only for emitting the resolved delimiter(s). However, the event data should not include row data. Rather, I would argue that rows should always be emitted under a single event which, I believe, is "row". If we support a "delimiter" event upon resolving a delimiter (i.e., only when a specific delimiter is not specified), then a user could, in principle, "force" a delimiter guess by supplying limited DSV data and varying the number of lines buffered before attempting to guess a delimiter. In short, if a user wants to only extract a delimiter guess, they should be able to do so, but I do not believe this is a common enough use case to warrant a dedicated public method.

Re: delimiter guess length. That makes sense. Otherwise, will become prohibitively expensive to determine.

kgryte avatar Nov 07 '18 23:11 kgryte

When picking the delimiter there I imagined that there will be at least 2 conditions that must be fulfilled:

  1. A minimum row count
  2. All of the internal ratings for the delimiter guesses must fall below a certain threshold except for 1 of them. This is the one that will be picked as the delimiter (this isn't implemented yet and the blacklist/whitelist should help with this also)

The second condition means it's possible for the delimiter to never be picked if the input is structured in a certain way so I thought it would be a good idea to add a method for the user to use in case this happens.

As for the delimiter event, I thought I would try parsing the data internally once a delimiter gets picked and only pick it if the current data is parsed successfully as another safeguard against picking an incorrect delimiter. I would just use what already exists so I wouldn't be able to avoid emitting row and error. With the delimiter event, someone could avoid the random emits by adding the listeners only once the delimiter has been picked.

var parsed;
p.on('delimiter', (delimiter, parsedSoFar) => {
    parsed = parsedSoFar;
    p.on('row', (row, rowCount) => { });
    p.on('end', (row, rowCount) => { });
});

I agree these occurrences will probably be very rare so I'm probably overengineering.

rei2hu avatar Nov 08 '18 00:11 rei2hu

Do you have a "hard to guess" example? I would be curious to see what that might look like.

Re: events. The safeguard of analyzing the buffered data makes sense (i.e., once a delimiter has been guessed, then attempt to apply it when parsing). My understanding is that a "delimiter" event should only be emitted once. Once it has been emitted, any rows which are subsequently parsed (including those which were buffered during the initial guessing stage) would then be emitted via a "row" event.

Re: forced guess. Another way of dealing with this is that, if a single delimiter cannot be guessed, we just emit an "error" event saying that we cannot parse the DSV and, in the error message, request that a user either independently verify that file OR increase the number of rows analyzed before attempting to resolve a guess. If the parser is "ended" and a delimiter cannot be unambiguously determined, then we can also "error" stating that we could not determine the delimiter.

kgryte avatar Nov 08 '18 00:11 kgryte

Yes, the delimiter event will only be emitted once and then everything will be normal after that.

For the forced guess, the way I imagined the sniffer would work is that if the user has the option set when constructing DsvParser, anytime they do DsvParser.push(buffer), the data is redirected and held by the sniffer itself. The sniffer will start processing once enough rows of data are put in and will check the rows everytime DsvParser.push method is called until a valid delimiter is found which is when the delimiter event will be emitted. Before then, any row or error events will be from running a potentially valid delimiter. If the parser is ended before then via DsvParser.push(null), then an error can be thrown but I'm not sure about before that because it just means the sniffer needs more data.

The basic idea behind the current algorithm is to find the character that occurs x times in the most amount of rows. For example

a,b,c,d,e,f,g
h,i,j,k,l,m,n

Here we can see , appear 6 times in 2 of 2 rows (rating of 1) while each letter appears only once in 1 of 2 rows and 0 times in 1 of 2 rows (rating of 0.5) so , will be picked as the delimiter. If the data gets modified where k is replaced with c like so, then a problem occurs:

a,b,c,d,e,f,g
h,i,j,c,l,m,n

Now c has a rating of 1 but so does ,. And it's plausible that c is the actual delimiter in the idea that the data can be properly parsed if c is the delimiter. This can be avoided with the blacklist/whitelist idea although there could be other problems when this is extended to other symbols. Another problem is dealing with quoted fields because they can contain the delimiter or newlines which can mess with the rankings.

So as long as another non delimiter character is contained the same amount of times in every new row added, there is no way to tell what the delimiter is.

I'm not sure how likely this will occur in actual dsv data and with more rows of data to use, it will most likely not pose a problem.

rei2hu avatar Nov 08 '18 01:11 rei2hu

Thanks for the explanation. My personal opinion is that no row events should be emitted until after a delimited is determined (i.e., when X number of lines have been buffered and processed). I would avoid emitting row events based on potential delimiters.

Regarding the possibility that X number of lines have been queued and yet we are still unable to determine a best guess delimiter, I think that emitting an error event is fine. As I see it, this is a user "error" for providing data and specifying an inappropriate parser configuration. I would like to avoid getting into the game of making the number of lines to analyze for a delimiter guess a "soft" limit. My preference would be punt the responsibility to the user to specify a sufficient number of lines. If we cannot satisfy this hard limit, then the user needs to check his/her data and readjust. I see such error handling as a feature, rather than a bug, allowing us to flag and fail quickly. In the soft limit case, suppose the second example is repeated for 1000s of lines, we could potentially never reach a resolution if we adopt a soft-limit policy, constantly trying to find the right number of lines such that we can "unambiguously" declare a delimiter.

Ah...and I should clarify something. I would not make the number of lines which must be queued before delimiter resolution a minimum, which is then adjusted based on subsequently pushed data. I would make it a hard limit. A user specifies opts = { 'watermark': 10 }. Once we have 10 lines, we analyze the data and attempt to determine a delimiter. If we fail, we emit an error and stop processing. Otherwise, we emit the delimiter guess and then emit the parsed rows. After which, we then emit a row whenever we have been provided sufficient data to resolve a new row.

kgryte avatar Nov 08 '18 01:11 kgryte

I understand now. With a hard upper limit like you described I see why there's no need for a lot of things I was proposing.

rei2hu avatar Nov 08 '18 03:11 rei2hu

@rei2hu Is this ready for "formal" review or still a WIP? No pressure. Just want to make sure that you are not blocked and have not been waiting for PR feedback.

kgryte avatar Nov 23 '18 23:11 kgryte

I haven't started on the other files typically found in each package (benchmarks, docs, examples) so I think it's still WIP.

For the committed files so far, I plan to clean up the comments in the main file and rework some things in tests (wording and how the last test suite for delim guessing calculates stuff). The other existing files are complete though, although that just leaves the README, package.json and the fixtures I believe.

rei2hu avatar Nov 24 '18 00:11 rei2hu

Closing this in favor of https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/utils/dsv/base/parse.

kgryte avatar Sep 15 '22 21:09 kgryte