Debian Control Format (dcf)
Suppose we have a file desc.dcf with records separated by empty lines and continued lines starting with whitespace.
Package: zoo
Version: 1.8-12
Date: 2023-04-11
Title: S3 Infrastructure for Regular and Irregular Time Series (Z's
Ordered Observations)
Package: sqldf
Version: 0.4-11
Date: 2017-06-23
Title: Manipulate R Data Frames Using SQL
Author: G. Grothendieck <[email protected]>
There are two problems:
- we can have multiline fields with continuation lines denoted by initial whitespace
- the records can have different fields
R has read.dcf/write.dcf to handle this type of file but I wondered if it were possible in Miller.
I have made some partial progress. If we remove the continued lines then this seems to work but we have the first problem above and if we use --ocsv in place of --ojson we have the second problem too.
grep -v "^\s+\S" desc.dcf | mlr --ixtab --ojson --ips=":" cat
What would be desired is to get a csv output with multiline fields and be able to use all fields.
Hi, running
sed ':a;N;$!ba;s/\n \+/~/g' input.txt | mlr --ips ": " --ixtab --ocsv put '$Title=gsub($Title,"~","\n")'
you get a CSV with multiline fields
Package,Version,Date,Title
zoo,1.8-12,2023-04-11,"S3 Infrastructure for Regular and Irregular Time Series (Z's
Ordered Observations)"
sqldf,0.4-11,2017-06-23,Manipulate R Data Frames Using SQL,G. Grothendieck <[email protected]>
Thanks. The Author field header is missing and in general there can be many fields that were not in previoius records, it is not generic because one must know in advance that the Title field was multiline - the actual data has more fields, the sed is a bit complex and I was hoping for a straight-forward mlr solution but it is certainly closer than what I had.
One other feature that read.dcf supports is that it allows multiple instances of the same field name in a record. If the all= argument of read.dcf is TRUE then all instances are included so that the column becomes a list of character vectors; otherwise, only the last instance is used in records where they occur multiple times.
A: a
B: 1
B: 2
A: b
B: 3
A: c
B: 4
B: 5
B: 6
I assume that for all=TRUE that the above would result in this CSV file.
A,B
a,"1,2"
b,3
c,"4,5,6"
@ggrothendieck this is awesome!
XTAB format isn't aware of continunation lines (although this is a nice feature).
One idea is making a DCF format, which is much like XTAB (and could reuse much of the same code) but (a) defaults to PS of : , and (b) respects continuation lines.
@johnkerl, That would certainly be desirable. As far as I know there are no other command line utilities that support dcf so Miller would be the only one. Both the ability to read and to write dcf would be nice as R can read and write that format and can do so without needing any addon packages.
One additional feature not shown by the previous example is that dcf format can have duplicate names in a record. Different names can have different numbers of duplicates in the same record and different records do not necessarily have the same sets of names. For example
a: 123
b: 6
c: 61
b: 7
c: 62
a: 456
b: 16
b: 17
In R there is an all= argument to the read.dcf function. If it is FALSE
then only the last of each set of duplicates is used whereas if it is TRUE then
all duplicates are used. In the all=TRUE case duplicate names result in an R
list column which is a list of vectors (which do not have tohave the same
lengths).
There is some question on how to represent this. This might be represented in csv as any of these. Ideally the user could choose the representation or it would choose one but it would be possible to transform from one to another.
a,b,c
123,"6,7","61,62"
456,"16,17",
or
a,b.1,b.2,c.1,c.2
123,6,7,61,62
456,16,17,,
or
a,b,c
123,6,61
123,7,62
456,16,
456,17,
In JSON the duplicates could be represented as a vector of duplicates for each name since normally JSON does not have duplicate names.