jesterj icon indicating copy to clipboard operation
jesterj copied to clipboard

Implement CSV splitter, CSV scanner

Open dgoldenberg1234 opened this issue 8 years ago • 3 comments

We see two types of use-cases out there.

A CSV splitter is a common one: split CSV such that:

  • the first line is (optionally) the header
  • any empty lines are skipped
  • each CSV row is split into key-value (attribute-value) and the result is output as a Document with these attributes
  • the delimiter must be configurable and defaulted to a comma
  • a cell may span multiple lines i.e. a cell may have embedded newline(s) in it, such a cell is surrounded with quotes in CSV
  • rows may be jagged: a) if a row has fewer cells than columns specified by the header then cells are assigned to columns left to right, any missing values on the right are treated as null; b) if a row has more cells than columns specified by the header that should be treated as an error (log and keep going).

The other use-case is, one has a large CSV file(s) that one wants to treat as a "data source(s)"; so we'll want to have a scanner that reads it and does the same as the CSV splitter processor.

Additionally, users often want to provide the list (filter) of columns that are of interest. The scanner and splitter shall honor this list. It can be a list of 0-based column indices, for example.

Easiest is to use OpenCSV for the split functionality.

dgoldenberg1234 avatar Apr 05 '16 23:04 dgoldenberg1234

It seems like there are two use cases here. One is the scanning of very large files or possibly files being actively appended, and the other is for handling a blob of bytes that happens to be reasonably interpreted as CSV.

Thus It seems like this ticket might be broken into a LineByLine scanner with the following options:

  • line ending char(s)
  • tail (if true continue to try to read from the file)
  • quote matching enabled (lines must have an even number of non-escaped quotes)
    • quote char or quote pattern
    • quote escape char or quote escape pattern
  • Valid line start pattern (example: log lines must start with something that matches a date or be considered part of the same line)
  • Lines per Document ( chunk the file by lines instead of one doc per line)
  • Duplicate first N lines on every Document
  • Ignore the last N lines of the document (incompatible with tail=true)

The documents picked up by the LineByLine file scanner could then be sent to a step with a processor that parses the bytes as csv and maps columns onto a document for each row. It should have at least these options

  • Header Option
    • MEMORIZE - use the first row it ever encounters as field labels (for all future processing),
    • FIRST_ROW - The first row of each document bytes[] as labels (for the case where the whole file was read) or perhaps the scanner is duplicating row headers onto chunks of the documents
    • STATIC - just accept a list of positional labels for the case of a previously determined csv format.
  • List of fields (only used with STATIC option)
  • Character encoding

So perhaps split this into two tickets?

fsparv avatar Apr 06 '16 22:04 fsparv

I feel like LineByLine scanner (and probably a matching processor) are a separate issue.

It's possible that CSV scanner and CSV processor could inherit some of the logic. But I also feel like the CSV processing is easiest to implement by relying on OpenCSV so we don't have to reinvent CSV parsing.

There's the general use-case of read-line-by-line (scanner and processor). Split XML, Split JSON, Split CSV all have their specifics.

Orthogonal to that is how the input to these is defined. My thinking was that scanners read specific files. Processors read the passed in byte[] as a full CSV, XML, JSON source - or - seems like they may just be getting a single line / chunk of text interpretable as CSV, XML, JSON.

All of these cases can be addressed by the tickets already filed, plus probably 2 more tickets: one for the LineByLine scanner, and one (or more) for interpreting a line (or a segment) as CSV, XML, JSON.

Thoughts?

dgoldenberg1234 avatar Apr 07 '16 00:04 dgoldenberg1234

I'm thinking that LineByLine can be Split Text, with LineSplitCount - tells how many lines per segment, options for the header, and whether to remove newlines.

Additionally we can have ExtractWithRegex or FilterWithRegex so we don't have the matching logic right in the splitters. Perhaps filtering is a separate operation...

I like the "tail" functionality.

Quote matching seems quite specific to CSV and may be best left to the CSV Splitter...

dgoldenberg1234 avatar Apr 07 '16 01:04 dgoldenberg1234

Generalizing this, the specific case of CSV (or JSON or TSV, or whatever) can be handled by a document processor (for the JSON or some other cases we possibly want a different splitting logic, but json obj/line is certainly a possible format too). Document id's will add #Ln to the normal file url where n is the line number, things like a CSV processor can key off of the line number to find headers (or not as desired) etc.

nsoft avatar Apr 08 '23 16:04 nsoft