jq icon indicating copy to clipboard operation
jq copied to clipboard

Support for CSV-formatted strings

Open kenorb opened this issue 7 years ago • 37 comments

Examples which I should expect to work:

$ echo '"1,2,3"' | jq -R "@csv"
jq: error (at <stdin>:1): string ("\"1,2,3\"") cannot be csv-formatted, only array
$ echo '"1,2,3"' | jq -r "@csv"
jq: error (at <stdin>:1): string ("1,2,3") cannot be csv-formatted, only array
$ echo '"1,2,3"' | jq -r "@tsv"
jq: error (at <stdin>:1): string ("1,2,3") cannot be tsv-formatted, only array

This works fine:

$ echo '[1,2,3]' | jq -r "@csv"
1,2,3
$ echo '[1,2,3]' | jq -r "@tsv"
1	2	3

However, the idea is to convert CSV from the string, not from the array.

My version:

$ jq --version
jq-1.5

kenorb avatar Apr 16 '18 21:04 kenorb

  1. Please read the jq documentation for @csv, the input of which must be a flat array.

  2. Consider this variant of your first try:

    echo '"1,2,3"' | jq -r 'split(",") | @csv' "1","2","3"

  3. For your reference, usage questions are best asked at stackoverflow.com with the jq tag: https://stackoverflow.com/questions/tagged/jq

pkoppstein avatar Apr 16 '18 21:04 pkoppstein

Does it mean the feature won't be implemented?

kenorb avatar Apr 16 '18 21:04 kenorb

Does it mean the feature won't be implemented?

I'm not sure what "feature" you mean, but as illustrated, it's trivially easy to convert a JSON string with embedded commas into valid CSV, and if you want the numbers in the JSON string to be recognized as such, you could modify the filter to: split(",") | map(tonumber? // .) | @csv.

In any case, to the best of my knowledge, there is no outstanding "enhancement request" under consideration for further changes to either @csv or @tsv.

pkoppstein avatar Apr 17 '18 02:04 pkoppstein

This for me doesn't work as expected as split removes quoted commas:

$ echo '1,"2a,2b",3' | jq -Rr '.|split(",")|@tsv'
1	"2a	2b"	3
$ echo '1,"2a,2b",3' | jq -Rr '.|split(",")|@csv'
"1","""2a","2b""","3"
$ echo '1,"2a,2b",3' | jq -Rr '.|split(",")|@text'
["1","\"2a","2b\"","3"]

So I know it's just more than just splitting the commas. I've checked already scan()/capture() as potential use, but the regex would be to complex.

I'm aware this workaround works:

$ echo '[1,"2a,2b",3]' | jq -r "@tsv"
1	2a,2b	3

but then the input is not a proper CSV input as expected, as brackets need to be added, so input requires additional external processing. So I believe processing for string could work similar way by supporting strings but without the [ and ] brackets, like:

$ echo '"1,\"2a,2b\",3"' | jq -r "@tsv"
jq: error (at <stdin>:1): string ("1,\"2a,2b\...) cannot be tsv-formatted, only array

If I understand filters correctly (such as @csv, @tsv) that are expecting input in this specific format, but the limitation is that 'the input must be an array'. It seems other inputs supports plain input, but these specific one does not support string format, so feature request is about supporting strings as described, but I may misunderstand what input means.

I understand jq is not a proper converter, but I think it would be a good feature to support input CSV in a string format as well. Related GH-645.

Another use case: Converting CSV to TSV. A lot of CLI tools can do this simple conversion, so why not to make jq better.

kenorb avatar Apr 17 '18 10:04 kenorb

  1. Currently, the best way to get jq to read a CSV file is to use EITHER a CSV-to-TSV or CSV-to-json program that can pipe into jq, or perhaps to use a wrapper such as cq. For further details, see the jq FAQ: Can jq process CSV or TSV files?

  2. There has been some discussion about adding pre-filters, but given the backlog, such additions may be a long time in coming to fruition, especially given the predisposition of the jq maintainers to avoid new comand-line options and bloat. That is, jq seeks to dovetail with other tools if available, which is the case with CSV, especially since jq can handle TSV input natively.

pkoppstein avatar Apr 17 '18 15:04 pkoppstein

Just having a use case for this, going to note the results of my brief research:

  • jq alone cannot really be used, many Stack Overflow examples use split(",") which doesn't really work if data contains escaped commas, see e.g. https://github.com/stedolan/jq/issues/1650#issuecomment-381942184.
  • cq as suggested in the FAQ is relatively hard to get to, it's part of jqt which seems to be a small experimental project that doesn't even have binary releases on GitHub. The author states "jqt is developed under the Fedora Linux distribution, and a lot of portability issues are expected at this stage of development.".
  • Googling "csv to json command line" yields a couple of npm packages and quick'n'dirty scripts around the web. Unlike with jq itself, one has to try several options, evaluate them, etc.

So I don't think that CSV scenario has, currently, a similarly good solution as e.g. YAML (which is very-well covered by yq).

borekb avatar Nov 28 '18 09:11 borekb

I'm unclear as to what you're asking for.

Is it that you want to be able to read CSV/TSV into arrays, or CSV/TSV and pass it through?

nicowilliams avatar Dec 14 '18 16:12 nicowilliams

I do think it'd be nice if jq supported more input formats that raw and JSON. Ideally I'd like jq to support CSV/TSV as inputs (with and without value quoting), and maybe even YAML and XML. But if this isn't what you're asking for then never mind.

nicowilliams avatar Dec 14 '18 16:12 nicowilliams

@nicowilliams Yes, process CSV with jq, or as you say, generally more input formats.

borekb avatar Dec 14 '18 16:12 borekb

OK, thanks. We would definitely not use @ syntax for that -- that syntax has to be exclusively about output because otherwise we'd have confusing behavior, I think.

We'd have fromcsv and fromtsv builtins. In fact, you can code these in jq right now:

$ jq -cn '[range(5)]|@csv|split(",")'
["0","1","2","3","4"]

but that doesn't take into account quoting of string values. To handle that would require something more like a reduction-based parser:

    reduce (explode[]|[.]|implode) as $char ({result:[""],inquoted:false,nextisbackslashquoted:false};
        (.result|length|. -1) as $idx |
        if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
            .result[$idx] += $char
        elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
            .inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
        elif .nextisbackslashquoted then
            .result[$idx] += $char
        elif .inquoted then
            .result[$idx] += $char
        elif $char == "," then
            .result += [""]
        elif .result[$idx]|length == 0 and $char == "\"" then
            .inquoted = true
        else
            .result[$idx] += $char
        end) | .result;

That needs a bit more work, but it does work:

$ jq -Rcn '
def fromcsv:
    reduce (explode[]|[.]|implode) as $char ({result:[""],inquoted:false,nextisbackslashquoted:false};
        (.result|length|. -1) as $idx |
        if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
            .result[$idx] += $char
        elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
            .inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
        elif .nextisbackslashquoted then
            .result[$idx] += $char
        elif .inquoted then
            .result[$idx] += $char
        elif $char == "," then
            .result += [""]
        elif .result[$idx]|length == 0 and $char == "\"" then
            .inquoted = true
        else
            .result[$idx] += $char
        end) | .result; "\"a,b\",c,d e f,\"g,h i,j\""|fromcsv'
["a,b","c","d e f","g,h i,j"]
$ 

nicowilliams avatar Dec 14 '18 18:12 nicowilliams

That needs a bit more work, but it does work

Yes, it works well, except perhaps for the handling of embedded carriage-returns. It would at least be desirable for fromcsv to be able to act as the inverse of @csv.

Consider therefore:

$ jq -rn '["a\nb","c"] | @csv'
"a
b","c"

In my view, the above is correct, but that is largely immaterial, because whether it's correct or not, we see that there's an invertibility issue:

$ jq -rn '["a\nb","c"] | @csv' | jq -Rcf fromcsv.jq
["a"]
["b\"","c"]

pkoppstein avatar Dec 15 '18 05:12 pkoppstein

@pkoppstein Yes, that's one of the things I meant by "it needs a bit more work.

nicowilliams avatar Dec 15 '18 06:12 nicowilliams

@nicowilliams - Yes, I guessed as much, but there is so much confusion/disagreement/inconsistency about this particular point, I thought it would be worthwhile illustrating how well-behaved @csv actually is.

pkoppstein avatar Dec 15 '18 06:12 pkoppstein

@kenorb, @pkoppstein Here you go:

$ jq -cn '
def fromcsv:
    def initialize: {result:[""],ready:false,inquoted:false,nextisbackslashquoted:false};
    foreach (if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (initialize;
        (.result|length|. -1) as $idx |
        if .ready then initialize else . end |
        if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
            .result[$idx] += $char
        elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
            .inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
        elif .nextisbackslashquoted then
            .result[$idx] += $char
        elif .inquoted then
            .result[$idx] += $char
        elif $char == "," then
            .result += [""]
        elif $char == "\n" then
            .ready = true
        elif .result[$idx]|length == 0 and $char == "\"" then
            .inquoted = true
        else
            .result[$idx] += $char
        end;
        if .ready then .result else empty end); "\"a,\nb\",c,d e f,\"g,h i,j\"\n\"a,z\"\n"|fromcsv'
["a,\nb","c","d e f","g,h i,j"]
["a,z"]
$ 

The foreach adds a newline to the input if it doesn't already have it, then iterates over every character in the input. The update expression does all the parsing. The extract expression extracts complete outputs.

You can check that fromcsv consumes the outputs of @csv. Mind you, I've still not thought it entirely through, so there may well be a bug in there.

This pattern can be extended to parse things like XML too if you like (but, wow, that might be a lot of work!).

nicowilliams avatar Dec 17 '18 05:12 nicowilliams

Here's troublesome.csv, for which the invocation:

jq -rR 'include fromcsv; fromcsv|@csv'

should (I think we're agreed) be idempotent:

"C1","C2","C3","C4","C5"
"1"," ","5,9,13,17"
"esctab","\t","10","14","18"
"tab","	","11","15","19"
"4","8","12","16","20"
"split
line"
"foo\\nbar"

With the most recent version, the "split" is lost entirely.

pkoppstein avatar Dec 17 '18 10:12 pkoppstein

@pkoppstein It works fine if you add -s to the jq command-line! Can you figure out why? :)

Anyways, that reminds me we probably want a fromcsv/1 as well:

def fromcsv_initialize_helper:
    {result:[""],ready:false,inquoted:false,nextisbackslashquoted:false};
def fromcsv_update_helper($char):
    (.result|length|. -1) as $idx |
    if .ready then fromcsv_initialize_helper else . end |
    if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
        .result[$idx] += $char
    elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
        .inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
    elif .nextisbackslashquoted then
        .result[$idx] += $char
    elif .inquoted then
        .result[$idx] += $char
    elif $char == "," then
        .result += [""]
    elif $char == "\n" then
        .ready = true
    elif .result[$idx]|length == 0 and $char == "\"" then
        .inquoted = true
    else
        .result[$idx] += $char
    end;
def fromcsv(in):
    foreach (in | if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
        fromcsv_initialize_helper;
        fromcsv_update_helper($char);
        if .ready then .result else empty end);
def fromcsv:
    foreach (if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
        fromcsv_initialize_helper;
        fromcsv_update_helper($char);
        if .ready then .result else empty end);

And here's how to use it:

$ jq -Rr -L /tmp 'include "fromcsv"; fromcsv(.//empty,inputs) | @csv' <<EOF
> "C1","C2","C3","C4","C5"
> "1"," ","5,9,13,17"
> "esctab","\t","10","14","18"
> "tab"," ","11","15","19"
> "4","8","12","16","20"
> "split
> line"
> "foo\\nbar"
> EOF
"C1","C2","C3","C4","C5"
"1"," ","5,9,13,17"
"esctab","\t","10","14","18"
"tab"," ","11","15","19"
"4","8","12","16","20"
"split
line"
"foo\nbar"
$ 
$ cat > /tmp/in <<EOF
"C1","C2","C3","C4","C5"
"1"," ","5,9,13,17"
"esctab","\t","10","14","18"
"tab","	","11","15","19"
"4","8","12","16","20"
"split
line"
"foo\\nbar"
EOF
$ diff -u /tmp/in <(jq -Rr -L /tmp 'include "fromcsv"; fromcsv(.,inputs) | @csv' < /tmp/in)
$ echo $?
0
$ 

Did you catch that? I passed .//empty,inputs to fromcsv/1. That allows me not to worry about whether -n was used on the command-line. I.e., if I forgot to add -n this works because the first input, the one read by jq itself, gets prepended to the stream from inputs, and if I did remember to add -n then the first input is null and then the //empty causes it to be ignored.

Please beat up on this a bit more, then we can include this in the next release.

nicowilliams avatar Dec 17 '18 17:12 nicowilliams

I prefer (./"")[] to explode[]|[.]|implode but I've not profiled the code.

fadado avatar Dec 17 '18 19:12 fadado

@nicowilliams wrote:

It works fine if you add -s to the jq command-line! Can you figure out why? :)

Yes, but if one uses -s in conjunction with this second version, it will fail with ordinary CSV, such as:

C1,C2,C3,C4,C5
1,"""5,9,13,17"""
esctab,\t,10,14,18
tab,	,11,15,19
4,8,12,16,20

Anyway, yes, having a stream-oriented version is the way to go! Fantastic! Alas, version 3 also fails with "ordinary CSV" as above....

pkoppstein avatar Dec 17 '18 19:12 pkoppstein

@fadado wrote:

I prefer (./"")[] to explode[]|[.]|implode but I've not profiled the code.

Alas, using / is a little faster but requires way more memory:

$ /usr/bin/time -lp jq 'tostring|explode[]|[.]|implode| empty' jeopardy.json
user        31.83
sys          0.56
 984571904  maximum resident set size
    288772  page reclaims
     15149  involuntary context switches


$ /usr/bin/time -lp jq 'tostring | (./"")[] | empty' jeopardy.json
user        25.06
sys          2.97
4708450304  maximum resident set size
   1578353  page reclaims
     14215  involuntary context switches

pkoppstein avatar Dec 17 '18 19:12 pkoppstein

We do need an explodes that streams... Anyways, as to @pkoppstein's "ordinary CSV" (quotes in original)... well, it ain't very ordinary. I specifically thought of and decided to ignore improper quoting, mostly because a) it was more work than I wanted to do, b) I can't quite figure out the correct way to parse something like this "a"b",c, or worse: a"b,c or similar. One option is to preserve quotes in the middle without treating them specially, so that both "a"b",c and a"b,c get parsed as ["a\"b","c"]. This requires adding some state and state update logic:

def fromcsv(in):
    def initialize:
        {result:[""],
         ready:false,
         inquoted:false,
         nextisbackslashquoted:false,
         firstwasquote:false,
         lastwasquote:false};
    def update($char):
        (.result|length|. -1) as $idx
      | if .ready then initialize else . end
      | if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
            .result[$idx] += $char
          | .lastwasquote = false
        elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
            .inquoted = false            # should remember that next $char must be comma (or no next $char cause EOL)
          | .lastwasquote = true
        elif (.inquoted|not) and (.nextisbackslashquoted|not) and $idx > 0 and
             .lastwasquote and $char != "," and $char != "\n" then
            .result[$idx] += $char
          | .inquoted = .firstwasquote   # should remember that next $char must be comma (or no next $char cause EOL)
          | .lastwasquote = false
        elif .nextisbackslashquoted then
            .result[$idx] += $char
          | .lastwasquote = false
        elif .inquoted then
            .result[$idx] += $char
          | .lastwasquote = false
        elif $char == "," then
            .result += [""]
          | .inquoted = false
          | .lastwasquote = false
          | .firstwasquote = false
        elif $char == "\n" then
            .ready = true
        elif (.result[$idx]|length) == 0 and $char == "\"" then
            .firstwasquote = true
          | .inquoted = true
        else
            .result[$idx] += $char
        end;
    foreach (in | if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
        initialize;
        update($char);
        if .ready then .result else empty end);
def fromcsv: fromcsv(.);

nicowilliams avatar Dec 17 '18 20:12 nicowilliams

@nicowilliams - Yes, I also thought of the name explodes :-)

By "ordinary CSV" I mainly meant:

(a) there are no embedded raw newlines; (b) fields need not always be quoted; (c) values with any interior double-quotation marks must be quoted and the embedded quotes must be doubled.

I did not mean that fromcsv should handle pseudo-CSV, i.e. CSV that does not conform to any of the major "standards".

In particular, I agree that fromcsv should not be expected to tolerate improper quoting, such as the examples you give. (By contrast, the CSV value """5,9,13,17""" encodes the string "5,9,13,17".)

pkoppstein avatar Dec 17 '18 22:12 pkoppstein

@pkoppstein Oh? There is a standard for CSV? Do you have a link to a specification?

RFC 4180 is not a standard, FYI. There's a few W3C documents on the subject, such as https://www.w3.org/TR/2016/NOTE-tabular-data-primer-20160225/ and https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/ -- I'm not sure as to the status of those two.

Here's a blog post about this mess: https://chriswarrick.com/blog/2017/04/07/csv-is-not-a-standard/

Anyways, feel free to improve my fromcsv/1. I'm not going to spend any more time on it.

nicowilliams avatar Dec 17 '18 23:12 nicowilliams

@nicowilliams - I was careful to put "standard" in quotation marks. (Maybe I should have written """standard""" :-)

FWIW, it's been my view that CSV is such a mess, and jq's existing support for TSV and @csv is so good (especially in conjunction with third-party tools), that providing a full CSV parser should be viewed as a very low-priority goal, certainly not worthy of someone with your talents!

pkoppstein avatar Dec 17 '18 23:12 pkoppstein

@pkoppstein Ah yes, I missed that.

nicowilliams avatar Dec 17 '18 23:12 nicowilliams

Oh, so in CSV escaping is only of double quotes, and only by doubling them?

Aight, one more refinement. Honestly, this code does not look pretty to me. There has got to be a better way to write this...

def fromcsv(in; $allow_backslash_escaping):
    def initialize:
        {result:[""],
         ready:false,
         inquoted:false,
         nextisbackslashquoted:false,
         firstwasquote:false,
         lastwasquote:false};
    def update($char):
        (.result|length|. -1) as $idx
      | if .ready then initialize else . end
      | if $char != "\"" and $char != "," and
           $char != "\n" and $char != "\\" then # " close double quotes for vim
            # Not a special char, just add it now
            .result[$idx] += $char
          | .lastwasquote = false
        elif (.nextisbackslashquoted|not) and $allow_backslash_escaping and
             $char == "\\" then # " close double quotes for vim
            # Backslash escaping next $char
            .nextisbackslashquoted = true
          | .lastwasquote = true
        elif .lastwasquote and $char == "\"" then
            # double-quote-quoted double-quote
            .result[$idx] += $char
          | .lastwasquote = false
        elif .inquoted and (.lastwasquote|not) and $char == "\"" then
            # double-quote either ending a value or escaping the next
            # $char that should be a double-quote; next char will tell
            # us which it is
            .lastwasquote = true
        elif .inquoted and .lastwasquote and $char == "," then
            # comma-ending a field
            .result += [""]
          | .inquoted = false
          | .lastwasquote = false
          | .firstwasquote = false
        elif .inquoted and .lastwasquote and $char == "\n" then
            # newline-ending a field
            .ready = true
        elif .inquoted or .nextisbackslashquoted then
            .result[$idx] += $char
          | .lastwasquote = false
          | .nextisbackslashquoted = false
        elif $char == "," then
            .result += [""]
          | .inquoted = false
          | .lastwasquote = false
          | .firstwasquote = false
        elif $char == "\n" then
            .ready = true
        elif (.result[$idx]|length) == 0 and $char == "\"" then
            .firstwasquote = true
          | .lastwasquote = false
          | .inquoted = true
        else
            .result[$idx] += $char
        end;
    foreach (in | if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
        initialize;
        update($char);
        if .ready then .result else empty end);
def fromcsv(stream): fromcsv(stream; false);
def fromcsv: fromcsv(.);

nicowilliams avatar Dec 18 '18 00:12 nicowilliams

[@nicowilliams - This is not a request or suggestion that you spend more time on this!]

The latest version works splendidly except when non-quoted fields are encountered. E.g. for the "ordinary" CSV presented above, the result has additional empty or null fields at the beginning of the follow-on records:

"C1","C2","C3","C4","C5"
"",,,,"1","""5,9,13,17"""
"",,,,,"esctab","\t","10","14","18"
"",,,,,,,,,"tab","	","11","15","19"
"",,,,,,,,,,,,,"4","8","12","16","20"

pkoppstein avatar Dec 18 '18 05:12 pkoppstein

Honestly, this code does not look pretty to me. There has got to be a better way to write this...

Yes: mimic the EBNF in https://tools.ietf.org/html/rfc4180:

# From: https://tools.ietf.org/html/rfc4180
#
# file = [header CRLF] record *(CRLF record) [CRLF]
# header = name *(COMMA name)
# record = field *(COMMA field)
# name = field
# field = (escaped / non-escaped)
# escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
# non-escaped = *TEXTDATA
# COMMA = %x2C
# CR = %x0D
# DQUOTE =  %x22
# LF = %x0A
# CRLF = CR LF
# TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

def fromcsv: # . is a record
    def csv($str; $len):
        # combinators (primitive: `|`, `,`, `recurse`: sequence, alternate and Kleene*)
        def plus(scanner): scanner | recurse(scanner); # Kleene+
        def optional(scanner): first(scanner , .);
        # scanners: position -> position
        def char(test): select(. < $len and ($str[.:.+1] | test)) | .+1; 
        def many($alphabet): select(. < $len) | last(plus(char(inside($alphabet)))) // empty;
        def CR: char(. == "\r");
        def LF: char(. == "\n");
        def COMMA: char(. == ",");
        def DQUOTE: char(. == "\"");
        def DQUOTE2: DQUOTE | DQUOTE;
        def TEXTDATA: char(.=="\t" or .>=" " and .!="," and .!="\"" and .!="\u007F");
        def SPACE: optional(many(" \t"));
        def non_escaped: recurse(TEXTDATA);
        def escaped: DQUOTE | recurse(first(TEXTDATA , COMMA , CR , LF , DQUOTE2)) | DQUOTE;
        # Parse fields and records
        def field: . as $i | first((escaped|[true,$i,.]) , (non_escaped|[false,$i,.]));
        def record:
            def r:
                field as [$e,$i,$j]
                | if $e then $str[$i+1:$j-1] else $str[$i:$j] end
                , ($j | SPACE | COMMA | SPACE | r);
            0|r
        ;
        # Collect record fields
        [ record ]
    ;
    rtrimstr("\r\n") as $str | csv($str; $str|length)
;

def fromcsv(stream): # stream of records
    stream | fromcsv
;

Considerations:

  • This is an exemple of jq capabilities implementing backtrack parsers!
  • Input is assumed to be compliant with the RFC record production.
  • As a bonus I accept optional space around commas.
  • Space at begin/end of records is not accepted.
  • The management of records with embeded \n or \r inside quoted fields is fromcsv caller responsability.
  • I ignored any efficiency consideration.

Happy hacking!

fadado avatar Dec 22 '18 17:12 fadado

@fadado Nice! Reminds me of how one does parsing in Icon. In Icon there's a "scanning context" that consists of a string and start and end offsets into the string, and parsing functions basically modify the current scanning context as they go, on backtracking undoing their side-effect of modifying the current scanning context. The current scanning context is akin to a dynamic global in a LISP, so that you can push new scanning contexts.

EDIT: What @fadado is doing is to setup things so that $str and $len and . define the current scanning context, with . as the current offset into $str. The parse finishes when . == $len. Backtracking works because if you end up outputting an earlier value of . then you've rewound so an alternative can be used.

@fadado's code isn't quite right though, as I'm sure some might have noticed. I think making it perfect will require changing . to be a bit more complex than just an offset into $str...

nicowilliams avatar Dec 22 '18 22:12 nicowilliams

@fadado Here's an improved version of yours:

# From: https://tools.ietf.org/html/rfc4180
#
# file = [header CRLF] record *(CRLF record) [CRLF]
# header = name *(COMMA name)
# record = field *(COMMA field)
# name = field
# field = (escaped / non-escaped)
# escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
# non-escaped = *TEXTDATA
# COMMA = %x2C
# CR = %x0D
# DQUOTE =  %x22
# LF = %x0A
# CRLF = CR LF
# TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

def fromcsv(stream): # . is a record
    # We keep track of the string and its length via the $str and $len
    # argument bindings of the csv/2 local function below.
    #
    # We keep "mutable" state as the input to the csv/2 local function
    # and all its local functions.
    #
    # This is very similar to how string parsing is done in Icon, which
    # has a "string scanning context" that consists of a string and an
    # offset into that string.
    #
    # Backtracking is accomplished by using `empty` and ultimately
    # causing a different `.` to be output (or none at all).
    #
    def init($str): # Setup / update parse state:
        {o:0,             # offset of current/next char
         u:"",            # accumulator of current field's content
         r:[],            # accumulator of current record's fields
         start:-1,        # start of current field (including dquote)
         end:-1,          # end   of current field
         wantmore:false,  # true if we need more input for split dquoted-string
         s:$str,          # copy of $str used for wantmore processing
        };

    # Setup string scanning context and parse it:
    def csv($str; $len):
        def fieldstart: .start=.o;
        def fieldend: .end=.o;
        def update_escaped: (.u = .u + $str[.ddquote:.o - 1]) | (.ddquote = .o);
        def finish_escaped:  .u = .u + $str[.ddquote:.end];
        def extract_non_escaped: (.u=$str[.start:.end]);
        def wantmore: .wantmore = true;
        # Combinators (primitive: `|`, `,`, `recurse`: sequence, alternate and Kleene*)
        def plus(scanner): scanner | recurse(scanner); # Kleene+
        def optional(scanner): first(scanner , .);
        # Scanners: current offset -> new offset
        def char(test): select(.o < $len and ($str[.o:.o+1] | test)) | (.o+=1);
        def many(test): select(.o < $len) | last(plus(char(test))) // empty;
        # Terminals
        def CR: char(. == "\r");
        def LF: char(. == "\n");
        def COMMA: char(. == ",");
        def DQUOTE: char(. == "\"");
        def DQUOTE2:
            DQUOTE | DQUOTE
          | update_escaped;
        def TEXTDATA: many((.=="\t" or .>=" ") and . != "," and . != "\"" and . != "\u007F");
        # Rules
        def escaped:      # ANBF: escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
            fieldstart
          | DQUOTE
          | (.ddquote=.o)
          | last(plus(first(TEXTDATA , COMMA , CR , LF , DQUOTE2)))
          | fieldend
          | ((DQUOTE | finish_escaped) // wantmore);
        def non_escaped:  # ABNF: non-escaped = *TEXTDATA
            fieldstart
          | TEXTDATA
          | fieldend
          | extract_non_escaped;
        # Parse fields and records
        def field:        # ABNF: field = (escaped / non-escaped)
            first(escaped, non_escaped);
        #def record: field | last(plus(COMMA | field));
        def record:       # ABNF: record = field *(COMMA field)
            # This one is complicated because we need to handle
            # .wantmore == true.  It can probably be simplified though.
            def r:
                if .wantmore then .wantmore = false else .  end
              | field
              | if .wantmore then .
                else
                    . as $i
                  | .r += [.u]
                  | . as $i
                  | ((.start=-1 | .end=-1 | COMMA | r) // $i)
                end;
            r;
        # Collect record fields
        record ;
    foreach stream as $str (
        init("");
        if .wantmore then       # Append new intput to prev and restart parse
            .s += $str
          | .o = .start
        else init($str) end     # Reset state for new parse
      | csv(.s; .s|length);     # Parse
        select(.wantmore|not).r # Output complete parses
    );

def fromcsv: fromcsv(.);
$ jq -Rrcn -L . 'include "fromcsv"; fromcsv("\"a\n","\"\"b\"\"\",c","\"a\n\"\"b\"\"\",c","a,b c,d")'
["a\n\"b\"","c"]
["a\n\"b\"","c"]
["a","b c","d"]
$ jq -Rrcn -L . 'include "fromcsv"; fromcsv("\"a\n","\"\"b\"\"\",c","\"a\n\"\"b\"\"\",c","a,b c,d")|@csv'
"a
""b""","c"
"a
""b""","c"
"a","b c","d"
$ 

nicowilliams avatar Dec 24 '18 01:12 nicowilliams

@fadado - Thank you so much for your inspiration.

Over at https://gist.github.com/pkoppstein/bbbbdf7489c8c515680beb1c75fa59f2 I've posted a parser that is similar to @fadado's but with an emphasis on efficiency (see "Performance" below), some error reporting (record number and context), and the "proper" handling of various edge cases. The parser is inspired by published PEG grammars, and can largely be viewed as a PEG parser.

The main filters are fromcsv and fromcsfile: the former handles raw NEWLINEs within quoted fields on a per-string basis. If a CSV file has quoted fields with embedded raw NEWLINES, then for it to be processed properly, it must either be "slurped" or fed to fromcsvfile.

Some other noteworthy points:

  1. The parser uses jq's support for regular expressions, for speed and simplicity.

  2. Various complexities of CSV are handled, including:

    • CRLF and CR are allowed as end-of-record markers;
    • empty fields can be represented by adjacent double quotes, e.g. a,"","""",d => ["a","","","d"]
    • lines such as a,,,b and ,,, are handled
    • a, ,b is interpreted as a," ",b
    • quoted fields can contain raw newlines
  3. spaces around quoted fields are ignored.

Examples:

Invocation: jq -nRc 'include "fromcsvfile.jq"; fromcsvfile' <<< TEXT

  1. Input need not be rectangular

$'a,b\nc' => ["a","b"] ["c"]

  1. Null fields can be specified in several ways

$'a,"",b,,c\n,,' #=> ["a","","b","","c"] ["","",""]

  1. Empty lines are ignored

$'a,b\n\nc,d' #=> ["a","b"] ["c","d"]

  1. Edge cases

a => ["a"]

$'a\n' => ["a"]

Performance:

There is a sample CSV file at http://support.spatialkey.com/spatialkey-sample-csv-data/ called Sacramentorealestatetransactions.csv It uses '\r' as the line delimiter so I ran tr '\r' '\n' to convert it for testing purposes here.

IN=Sacramentorealestatetransactions.csv

/usr/bin/time -lp jq -nRc 'include "fromcsvfile"; fromcsvfile' $IN > OUT 
user         0.77
sys          0.00
   2473984  maximum resident set size
/usr/bin/time -lp jq -Rc 'include "fromcsvfile"; fromcsv'  $IN > OUT
user         0.73
sys          0.00
   2322432  maximum resident set size
# @nicowilliams' parser as posted earlier in this thread
/usr/bin/time -lp jq -nRc 'include "nw-fromcsv"; fromcsv(inputs)' $IN > OUT
user         1.33
sys          0.00
   2379776  maximum resident set size

pkoppstein avatar Dec 24 '18 04:12 pkoppstein