jq
jq copied to clipboard
Support for CSV-formatted strings
Examples which I should expect to work:
$ echo '"1,2,3"' | jq -R "@csv"
jq: error (at <stdin>:1): string ("\"1,2,3\"") cannot be csv-formatted, only array
$ echo '"1,2,3"' | jq -r "@csv"
jq: error (at <stdin>:1): string ("1,2,3") cannot be csv-formatted, only array
$ echo '"1,2,3"' | jq -r "@tsv"
jq: error (at <stdin>:1): string ("1,2,3") cannot be tsv-formatted, only array
This works fine:
$ echo '[1,2,3]' | jq -r "@csv"
1,2,3
$ echo '[1,2,3]' | jq -r "@tsv"
1 2 3
However, the idea is to convert CSV from the string, not from the array.
My version:
$ jq --version
jq-1.5
-
Please read the jq documentation for @csv, the input of which must be a flat array.
-
Consider this variant of your first try:
echo '"1,2,3"' | jq -r 'split(",") | @csv' "1","2","3"
-
For your reference, usage questions are best asked at stackoverflow.com with the jq tag: https://stackoverflow.com/questions/tagged/jq
Does it mean the feature won't be implemented?
Does it mean the feature won't be implemented?
I'm not sure what "feature" you mean, but as illustrated, it's trivially easy to convert a JSON string with embedded commas into valid CSV, and if you want the numbers in the JSON string to be recognized as such, you could modify the filter to: split(",") | map(tonumber? // .) | @csv.
In any case, to the best of my knowledge, there is no outstanding "enhancement request" under consideration for further changes to either @csv or @tsv.
This for me doesn't work as expected as split removes quoted commas:
$ echo '1,"2a,2b",3' | jq -Rr '.|split(",")|@tsv'
1 "2a 2b" 3
$ echo '1,"2a,2b",3' | jq -Rr '.|split(",")|@csv'
"1","""2a","2b""","3"
$ echo '1,"2a,2b",3' | jq -Rr '.|split(",")|@text'
["1","\"2a","2b\"","3"]
So I know it's just more than just splitting the commas. I've checked already scan()/capture() as potential use, but the regex would be to complex.
I'm aware this workaround works:
$ echo '[1,"2a,2b",3]' | jq -r "@tsv"
1 2a,2b 3
but then the input is not a proper CSV input as expected, as brackets need to be added, so input requires additional external processing. So I believe processing for string could work similar way by supporting strings but without the [ and ] brackets, like:
$ echo '"1,\"2a,2b\",3"' | jq -r "@tsv"
jq: error (at <stdin>:1): string ("1,\"2a,2b\...) cannot be tsv-formatted, only array
If I understand filters correctly (such as @csv, @tsv) that are expecting input in this specific format, but the limitation is that 'the input must be an array'. It seems other inputs supports plain input, but these specific one does not support string format, so feature request is about supporting strings as described, but I may misunderstand what input means.
I understand jq is not a proper converter, but I think it would be a good feature to support input CSV in a string format as well. Related GH-645.
Another use case: Converting CSV to TSV. A lot of CLI tools can do this simple conversion, so why not to make jq better.
-
Currently, the best way to get jq to read a CSV file is to use EITHER a CSV-to-TSV or CSV-to-json program that can pipe into jq, or perhaps to use a wrapper such as cq. For further details, see the jq FAQ: Can jq process CSV or TSV files?
-
There has been some discussion about adding pre-filters, but given the backlog, such additions may be a long time in coming to fruition, especially given the predisposition of the jq maintainers to avoid new comand-line options and bloat. That is, jq seeks to dovetail with other tools if available, which is the case with CSV, especially since jq can handle TSV input natively.
Just having a use case for this, going to note the results of my brief research:
jqalone cannot really be used, many Stack Overflow examples usesplit(",")which doesn't really work if data contains escaped commas, see e.g. https://github.com/stedolan/jq/issues/1650#issuecomment-381942184.cqas suggested in the FAQ is relatively hard to get to, it's part of jqt which seems to be a small experimental project that doesn't even have binary releases on GitHub. The author states "jqt is developed under the Fedora Linux distribution, and a lot of portability issues are expected at this stage of development.".- Googling "csv to json command line" yields a couple of npm packages and quick'n'dirty scripts around the web. Unlike with
jqitself, one has to try several options, evaluate them, etc.
So I don't think that CSV scenario has, currently, a similarly good solution as e.g. YAML (which is very-well covered by yq).
I'm unclear as to what you're asking for.
Is it that you want to be able to read CSV/TSV into arrays, or CSV/TSV and pass it through?
I do think it'd be nice if jq supported more input formats that raw and JSON. Ideally I'd like jq to support CSV/TSV as inputs (with and without value quoting), and maybe even YAML and XML. But if this isn't what you're asking for then never mind.
@nicowilliams Yes, process CSV with jq, or as you say, generally more input formats.
OK, thanks. We would definitely not use @ syntax for that -- that syntax has to be exclusively about output because otherwise we'd have confusing behavior, I think.
We'd have fromcsv and fromtsv builtins. In fact, you can code these in jq right now:
$ jq -cn '[range(5)]|@csv|split(",")'
["0","1","2","3","4"]
but that doesn't take into account quoting of string values. To handle that would require something more like a reduction-based parser:
reduce (explode[]|[.]|implode) as $char ({result:[""],inquoted:false,nextisbackslashquoted:false};
(.result|length|. -1) as $idx |
if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
.result[$idx] += $char
elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
.inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
elif .nextisbackslashquoted then
.result[$idx] += $char
elif .inquoted then
.result[$idx] += $char
elif $char == "," then
.result += [""]
elif .result[$idx]|length == 0 and $char == "\"" then
.inquoted = true
else
.result[$idx] += $char
end) | .result;
That needs a bit more work, but it does work:
$ jq -Rcn '
def fromcsv:
reduce (explode[]|[.]|implode) as $char ({result:[""],inquoted:false,nextisbackslashquoted:false};
(.result|length|. -1) as $idx |
if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
.result[$idx] += $char
elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
.inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
elif .nextisbackslashquoted then
.result[$idx] += $char
elif .inquoted then
.result[$idx] += $char
elif $char == "," then
.result += [""]
elif .result[$idx]|length == 0 and $char == "\"" then
.inquoted = true
else
.result[$idx] += $char
end) | .result; "\"a,b\",c,d e f,\"g,h i,j\""|fromcsv'
["a,b","c","d e f","g,h i,j"]
$
That needs a bit more work, but it does work
Yes, it works well, except perhaps for the handling of embedded carriage-returns. It would at least be desirable for fromcsv to be able to act as the inverse of @csv.
Consider therefore:
$ jq -rn '["a\nb","c"] | @csv'
"a
b","c"
In my view, the above is correct, but that is largely immaterial, because whether it's correct or not, we see that there's an invertibility issue:
$ jq -rn '["a\nb","c"] | @csv' | jq -Rcf fromcsv.jq
["a"]
["b\"","c"]
@pkoppstein Yes, that's one of the things I meant by "it needs a bit more work.
@nicowilliams - Yes, I guessed as much, but there is so much confusion/disagreement/inconsistency about this particular point, I thought it would be worthwhile illustrating how well-behaved @csv actually is.
@kenorb, @pkoppstein Here you go:
$ jq -cn '
def fromcsv:
def initialize: {result:[""],ready:false,inquoted:false,nextisbackslashquoted:false};
foreach (if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (initialize;
(.result|length|. -1) as $idx |
if .ready then initialize else . end |
if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
.result[$idx] += $char
elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
.inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
elif .nextisbackslashquoted then
.result[$idx] += $char
elif .inquoted then
.result[$idx] += $char
elif $char == "," then
.result += [""]
elif $char == "\n" then
.ready = true
elif .result[$idx]|length == 0 and $char == "\"" then
.inquoted = true
else
.result[$idx] += $char
end;
if .ready then .result else empty end); "\"a,\nb\",c,d e f,\"g,h i,j\"\n\"a,z\"\n"|fromcsv'
["a,\nb","c","d e f","g,h i,j"]
["a,z"]
$
The foreach adds a newline to the input if it doesn't already have it, then iterates over every character in the input. The update expression does all the parsing. The extract expression extracts complete outputs.
You can check that fromcsv consumes the outputs of @csv. Mind you, I've still not thought it entirely through, so there may well be a bug in there.
This pattern can be extended to parse things like XML too if you like (but, wow, that might be a lot of work!).
Here's troublesome.csv, for which the invocation:
jq -rR 'include fromcsv; fromcsv|@csv'
should (I think we're agreed) be idempotent:
"C1","C2","C3","C4","C5"
"1"," ","5,9,13,17"
"esctab","\t","10","14","18"
"tab"," ","11","15","19"
"4","8","12","16","20"
"split
line"
"foo\\nbar"
With the most recent version, the "split" is lost entirely.
@pkoppstein It works fine if you add -s to the jq command-line! Can you figure out why? :)
Anyways, that reminds me we probably want a fromcsv/1 as well:
def fromcsv_initialize_helper:
{result:[""],ready:false,inquoted:false,nextisbackslashquoted:false};
def fromcsv_update_helper($char):
(.result|length|. -1) as $idx |
if .ready then fromcsv_initialize_helper else . end |
if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
.result[$idx] += $char
elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
.inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
elif .nextisbackslashquoted then
.result[$idx] += $char
elif .inquoted then
.result[$idx] += $char
elif $char == "," then
.result += [""]
elif $char == "\n" then
.ready = true
elif .result[$idx]|length == 0 and $char == "\"" then
.inquoted = true
else
.result[$idx] += $char
end;
def fromcsv(in):
foreach (in | if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
fromcsv_initialize_helper;
fromcsv_update_helper($char);
if .ready then .result else empty end);
def fromcsv:
foreach (if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
fromcsv_initialize_helper;
fromcsv_update_helper($char);
if .ready then .result else empty end);
And here's how to use it:
$ jq -Rr -L /tmp 'include "fromcsv"; fromcsv(.//empty,inputs) | @csv' <<EOF
> "C1","C2","C3","C4","C5"
> "1"," ","5,9,13,17"
> "esctab","\t","10","14","18"
> "tab"," ","11","15","19"
> "4","8","12","16","20"
> "split
> line"
> "foo\\nbar"
> EOF
"C1","C2","C3","C4","C5"
"1"," ","5,9,13,17"
"esctab","\t","10","14","18"
"tab"," ","11","15","19"
"4","8","12","16","20"
"split
line"
"foo\nbar"
$
$ cat > /tmp/in <<EOF
"C1","C2","C3","C4","C5"
"1"," ","5,9,13,17"
"esctab","\t","10","14","18"
"tab"," ","11","15","19"
"4","8","12","16","20"
"split
line"
"foo\\nbar"
EOF
$ diff -u /tmp/in <(jq -Rr -L /tmp 'include "fromcsv"; fromcsv(.,inputs) | @csv' < /tmp/in)
$ echo $?
0
$
Did you catch that? I passed .//empty,inputs to fromcsv/1. That allows me not to worry about whether -n was used on the command-line. I.e., if I forgot to add -n this works because the first input, the one read by jq itself, gets prepended to the stream from inputs, and if I did remember to add -n then the first input is null and then the //empty causes it to be ignored.
Please beat up on this a bit more, then we can include this in the next release.
I prefer (./"")[] to explode[]|[.]|implode but I've not profiled the code.
@nicowilliams wrote:
It works fine if you add -s to the jq command-line! Can you figure out why? :)
Yes, but if one uses -s in conjunction with this second version, it will fail with ordinary CSV, such as:
C1,C2,C3,C4,C5
1,"""5,9,13,17"""
esctab,\t,10,14,18
tab, ,11,15,19
4,8,12,16,20
Anyway, yes, having a stream-oriented version is the way to go! Fantastic! Alas, version 3 also fails with "ordinary CSV" as above....
@fadado wrote:
I prefer (./"")[] to explode[]|[.]|implode but I've not profiled the code.
Alas, using / is a little faster but requires way more memory:
$ /usr/bin/time -lp jq 'tostring|explode[]|[.]|implode| empty' jeopardy.json
user 31.83
sys 0.56
984571904 maximum resident set size
288772 page reclaims
15149 involuntary context switches
$ /usr/bin/time -lp jq 'tostring | (./"")[] | empty' jeopardy.json
user 25.06
sys 2.97
4708450304 maximum resident set size
1578353 page reclaims
14215 involuntary context switches
We do need an explodes that streams... Anyways, as to @pkoppstein's "ordinary CSV" (quotes in original)... well, it ain't very ordinary. I specifically thought of and decided to ignore improper quoting, mostly because a) it was more work than I wanted to do, b) I can't quite figure out the correct way to parse something like this "a"b",c, or worse: a"b,c or similar. One option is to preserve quotes in the middle without treating them specially, so that both "a"b",c and a"b,c get parsed as ["a\"b","c"]. This requires adding some state and state update logic:
def fromcsv(in):
def initialize:
{result:[""],
ready:false,
inquoted:false,
nextisbackslashquoted:false,
firstwasquote:false,
lastwasquote:false};
def update($char):
(.result|length|. -1) as $idx
| if .ready then initialize else . end
| if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
.result[$idx] += $char
| .lastwasquote = false
elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
.inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
| .lastwasquote = true
elif (.inquoted|not) and (.nextisbackslashquoted|not) and $idx > 0 and
.lastwasquote and $char != "," and $char != "\n" then
.result[$idx] += $char
| .inquoted = .firstwasquote # should remember that next $char must be comma (or no next $char cause EOL)
| .lastwasquote = false
elif .nextisbackslashquoted then
.result[$idx] += $char
| .lastwasquote = false
elif .inquoted then
.result[$idx] += $char
| .lastwasquote = false
elif $char == "," then
.result += [""]
| .inquoted = false
| .lastwasquote = false
| .firstwasquote = false
elif $char == "\n" then
.ready = true
elif (.result[$idx]|length) == 0 and $char == "\"" then
.firstwasquote = true
| .inquoted = true
else
.result[$idx] += $char
end;
foreach (in | if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
initialize;
update($char);
if .ready then .result else empty end);
def fromcsv: fromcsv(.);
@nicowilliams - Yes, I also thought of the name explodes :-)
By "ordinary CSV" I mainly meant:
(a) there are no embedded raw newlines; (b) fields need not always be quoted; (c) values with any interior double-quotation marks must be quoted and the embedded quotes must be doubled.
I did not mean that fromcsv should handle pseudo-CSV, i.e. CSV that does not conform to any of the major "standards".
In particular, I agree that fromcsv should not be expected to tolerate improper quoting, such as the examples you give. (By contrast, the CSV value """5,9,13,17""" encodes the string "5,9,13,17".)
@pkoppstein Oh? There is a standard for CSV? Do you have a link to a specification?
RFC 4180 is not a standard, FYI. There's a few W3C documents on the subject, such as https://www.w3.org/TR/2016/NOTE-tabular-data-primer-20160225/ and https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/ -- I'm not sure as to the status of those two.
Here's a blog post about this mess: https://chriswarrick.com/blog/2017/04/07/csv-is-not-a-standard/
Anyways, feel free to improve my fromcsv/1. I'm not going to spend any more time on it.
@nicowilliams - I was careful to put "standard" in quotation marks. (Maybe I should have written """standard""" :-)
FWIW, it's been my view that CSV is such a mess, and jq's existing support for TSV and @csv is so good (especially in conjunction with third-party tools), that providing a full CSV parser should be viewed as a very low-priority goal, certainly not worthy of someone with your talents!
@pkoppstein Ah yes, I missed that.
Oh, so in CSV escaping is only of double quotes, and only by doubling them?
Aight, one more refinement. Honestly, this code does not look pretty to me. There has got to be a better way to write this...
def fromcsv(in; $allow_backslash_escaping):
def initialize:
{result:[""],
ready:false,
inquoted:false,
nextisbackslashquoted:false,
firstwasquote:false,
lastwasquote:false};
def update($char):
(.result|length|. -1) as $idx
| if .ready then initialize else . end
| if $char != "\"" and $char != "," and
$char != "\n" and $char != "\\" then # " close double quotes for vim
# Not a special char, just add it now
.result[$idx] += $char
| .lastwasquote = false
elif (.nextisbackslashquoted|not) and $allow_backslash_escaping and
$char == "\\" then # " close double quotes for vim
# Backslash escaping next $char
.nextisbackslashquoted = true
| .lastwasquote = true
elif .lastwasquote and $char == "\"" then
# double-quote-quoted double-quote
.result[$idx] += $char
| .lastwasquote = false
elif .inquoted and (.lastwasquote|not) and $char == "\"" then
# double-quote either ending a value or escaping the next
# $char that should be a double-quote; next char will tell
# us which it is
.lastwasquote = true
elif .inquoted and .lastwasquote and $char == "," then
# comma-ending a field
.result += [""]
| .inquoted = false
| .lastwasquote = false
| .firstwasquote = false
elif .inquoted and .lastwasquote and $char == "\n" then
# newline-ending a field
.ready = true
elif .inquoted or .nextisbackslashquoted then
.result[$idx] += $char
| .lastwasquote = false
| .nextisbackslashquoted = false
elif $char == "," then
.result += [""]
| .inquoted = false
| .lastwasquote = false
| .firstwasquote = false
elif $char == "\n" then
.ready = true
elif (.result[$idx]|length) == 0 and $char == "\"" then
.firstwasquote = true
| .lastwasquote = false
| .inquoted = true
else
.result[$idx] += $char
end;
foreach (in | if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
initialize;
update($char);
if .ready then .result else empty end);
def fromcsv(stream): fromcsv(stream; false);
def fromcsv: fromcsv(.);
[@nicowilliams - This is not a request or suggestion that you spend more time on this!]
The latest version works splendidly except when non-quoted fields are encountered. E.g. for the "ordinary" CSV presented above, the result has additional empty or null fields at the beginning of the follow-on records:
"C1","C2","C3","C4","C5"
"",,,,"1","""5,9,13,17"""
"",,,,,"esctab","\t","10","14","18"
"",,,,,,,,,"tab"," ","11","15","19"
"",,,,,,,,,,,,,"4","8","12","16","20"
Honestly, this code does not look pretty to me. There has got to be a better way to write this...
Yes: mimic the EBNF in https://tools.ietf.org/html/rfc4180:
# From: https://tools.ietf.org/html/rfc4180
#
# file = [header CRLF] record *(CRLF record) [CRLF]
# header = name *(COMMA name)
# record = field *(COMMA field)
# name = field
# field = (escaped / non-escaped)
# escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
# non-escaped = *TEXTDATA
# COMMA = %x2C
# CR = %x0D
# DQUOTE = %x22
# LF = %x0A
# CRLF = CR LF
# TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
def fromcsv: # . is a record
def csv($str; $len):
# combinators (primitive: `|`, `,`, `recurse`: sequence, alternate and Kleene*)
def plus(scanner): scanner | recurse(scanner); # Kleene+
def optional(scanner): first(scanner , .);
# scanners: position -> position
def char(test): select(. < $len and ($str[.:.+1] | test)) | .+1;
def many($alphabet): select(. < $len) | last(plus(char(inside($alphabet)))) // empty;
def CR: char(. == "\r");
def LF: char(. == "\n");
def COMMA: char(. == ",");
def DQUOTE: char(. == "\"");
def DQUOTE2: DQUOTE | DQUOTE;
def TEXTDATA: char(.=="\t" or .>=" " and .!="," and .!="\"" and .!="\u007F");
def SPACE: optional(many(" \t"));
def non_escaped: recurse(TEXTDATA);
def escaped: DQUOTE | recurse(first(TEXTDATA , COMMA , CR , LF , DQUOTE2)) | DQUOTE;
# Parse fields and records
def field: . as $i | first((escaped|[true,$i,.]) , (non_escaped|[false,$i,.]));
def record:
def r:
field as [$e,$i,$j]
| if $e then $str[$i+1:$j-1] else $str[$i:$j] end
, ($j | SPACE | COMMA | SPACE | r);
0|r
;
# Collect record fields
[ record ]
;
rtrimstr("\r\n") as $str | csv($str; $str|length)
;
def fromcsv(stream): # stream of records
stream | fromcsv
;
Considerations:
- This is an exemple of jq capabilities implementing backtrack parsers!
- Input is assumed to be compliant with the RFC
recordproduction. - As a bonus I accept optional space around commas.
- Space at begin/end of records is not accepted.
- The management of records with embeded
\nor\rinside quoted fields isfromcsvcaller responsability. - I ignored any efficiency consideration.
Happy hacking!
@fadado Nice! Reminds me of how one does parsing in Icon. In Icon there's a "scanning context" that consists of a string and start and end offsets into the string, and parsing functions basically modify the current scanning context as they go, on backtracking undoing their side-effect of modifying the current scanning context. The current scanning context is akin to a dynamic global in a LISP, so that you can push new scanning contexts.
EDIT: What @fadado is doing is to setup things so that $str and $len and . define the current scanning context, with . as the current offset into $str. The parse finishes when . == $len. Backtracking works because if you end up outputting an earlier value of . then you've rewound so an alternative can be used.
@fadado's code isn't quite right though, as I'm sure some might have noticed. I think making it perfect will require changing . to be a bit more complex than just an offset into $str...
@fadado Here's an improved version of yours:
# From: https://tools.ietf.org/html/rfc4180
#
# file = [header CRLF] record *(CRLF record) [CRLF]
# header = name *(COMMA name)
# record = field *(COMMA field)
# name = field
# field = (escaped / non-escaped)
# escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
# non-escaped = *TEXTDATA
# COMMA = %x2C
# CR = %x0D
# DQUOTE = %x22
# LF = %x0A
# CRLF = CR LF
# TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
def fromcsv(stream): # . is a record
# We keep track of the string and its length via the $str and $len
# argument bindings of the csv/2 local function below.
#
# We keep "mutable" state as the input to the csv/2 local function
# and all its local functions.
#
# This is very similar to how string parsing is done in Icon, which
# has a "string scanning context" that consists of a string and an
# offset into that string.
#
# Backtracking is accomplished by using `empty` and ultimately
# causing a different `.` to be output (or none at all).
#
def init($str): # Setup / update parse state:
{o:0, # offset of current/next char
u:"", # accumulator of current field's content
r:[], # accumulator of current record's fields
start:-1, # start of current field (including dquote)
end:-1, # end of current field
wantmore:false, # true if we need more input for split dquoted-string
s:$str, # copy of $str used for wantmore processing
};
# Setup string scanning context and parse it:
def csv($str; $len):
def fieldstart: .start=.o;
def fieldend: .end=.o;
def update_escaped: (.u = .u + $str[.ddquote:.o - 1]) | (.ddquote = .o);
def finish_escaped: .u = .u + $str[.ddquote:.end];
def extract_non_escaped: (.u=$str[.start:.end]);
def wantmore: .wantmore = true;
# Combinators (primitive: `|`, `,`, `recurse`: sequence, alternate and Kleene*)
def plus(scanner): scanner | recurse(scanner); # Kleene+
def optional(scanner): first(scanner , .);
# Scanners: current offset -> new offset
def char(test): select(.o < $len and ($str[.o:.o+1] | test)) | (.o+=1);
def many(test): select(.o < $len) | last(plus(char(test))) // empty;
# Terminals
def CR: char(. == "\r");
def LF: char(. == "\n");
def COMMA: char(. == ",");
def DQUOTE: char(. == "\"");
def DQUOTE2:
DQUOTE | DQUOTE
| update_escaped;
def TEXTDATA: many((.=="\t" or .>=" ") and . != "," and . != "\"" and . != "\u007F");
# Rules
def escaped: # ANBF: escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
fieldstart
| DQUOTE
| (.ddquote=.o)
| last(plus(first(TEXTDATA , COMMA , CR , LF , DQUOTE2)))
| fieldend
| ((DQUOTE | finish_escaped) // wantmore);
def non_escaped: # ABNF: non-escaped = *TEXTDATA
fieldstart
| TEXTDATA
| fieldend
| extract_non_escaped;
# Parse fields and records
def field: # ABNF: field = (escaped / non-escaped)
first(escaped, non_escaped);
#def record: field | last(plus(COMMA | field));
def record: # ABNF: record = field *(COMMA field)
# This one is complicated because we need to handle
# .wantmore == true. It can probably be simplified though.
def r:
if .wantmore then .wantmore = false else . end
| field
| if .wantmore then .
else
. as $i
| .r += [.u]
| . as $i
| ((.start=-1 | .end=-1 | COMMA | r) // $i)
end;
r;
# Collect record fields
record ;
foreach stream as $str (
init("");
if .wantmore then # Append new intput to prev and restart parse
.s += $str
| .o = .start
else init($str) end # Reset state for new parse
| csv(.s; .s|length); # Parse
select(.wantmore|not).r # Output complete parses
);
def fromcsv: fromcsv(.);
$ jq -Rrcn -L . 'include "fromcsv"; fromcsv("\"a\n","\"\"b\"\"\",c","\"a\n\"\"b\"\"\",c","a,b c,d")'
["a\n\"b\"","c"]
["a\n\"b\"","c"]
["a","b c","d"]
$ jq -Rrcn -L . 'include "fromcsv"; fromcsv("\"a\n","\"\"b\"\"\",c","\"a\n\"\"b\"\"\",c","a,b c,d")|@csv'
"a
""b""","c"
"a
""b""","c"
"a","b c","d"
$
@fadado - Thank you so much for your inspiration.
Over at https://gist.github.com/pkoppstein/bbbbdf7489c8c515680beb1c75fa59f2 I've posted a parser that is similar to @fadado's but with an emphasis on efficiency (see "Performance" below), some error reporting (record number and context), and the "proper" handling of various edge cases. The parser is inspired by published PEG grammars, and can largely be viewed as a PEG parser.
The main filters are fromcsv and fromcsfile: the former handles raw NEWLINEs within quoted fields on a per-string basis. If a CSV file has quoted fields with embedded raw NEWLINES, then for it to be processed properly, it must either be "slurped" or fed to fromcsvfile.
Some other noteworthy points:
-
The parser uses jq's support for regular expressions, for speed and simplicity.
-
Various complexities of CSV are handled, including:
- CRLF and CR are allowed as end-of-record markers;
- empty fields can be represented by adjacent double quotes,
e.g.
a,"","""",d=> ["a","","","d"] - lines such as
a,,,band,,,are handled a, ,bis interpreted asa," ",b- quoted fields can contain raw newlines
-
spaces around quoted fields are ignored.
Examples:
Invocation: jq -nRc 'include "fromcsvfile.jq"; fromcsvfile' <<< TEXT
- Input need not be rectangular
$'a,b\nc' => ["a","b"] ["c"]
- Null fields can be specified in several ways
$'a,"",b,,c\n,,' #=> ["a","","b","","c"] ["","",""]
- Empty lines are ignored
$'a,b\n\nc,d' #=> ["a","b"] ["c","d"]
- Edge cases
a => ["a"]
$'a\n' => ["a"]
Performance:
There is a sample CSV file at http://support.spatialkey.com/spatialkey-sample-csv-data/
called Sacramentorealestatetransactions.csv
It uses '\r' as the line delimiter so I ran tr '\r' '\n' to convert it for testing purposes here.
IN=Sacramentorealestatetransactions.csv
/usr/bin/time -lp jq -nRc 'include "fromcsvfile"; fromcsvfile' $IN > OUT
user 0.77
sys 0.00
2473984 maximum resident set size
/usr/bin/time -lp jq -Rc 'include "fromcsvfile"; fromcsv' $IN > OUT
user 0.73
sys 0.00
2322432 maximum resident set size
# @nicowilliams' parser as posted earlier in this thread
/usr/bin/time -lp jq -nRc 'include "nw-fromcsv"; fromcsv(inputs)' $IN > OUT
user 1.33
sys 0.00
2379776 maximum resident set size