miller icon indicating copy to clipboard operation
miller copied to clipboard

Debian Control Format (dcf)

Open ggrothendieck opened this issue 7 months ago • 6 comments

Suppose we have a file desc.dcf with records separated by empty lines and continued lines starting with whitespace.

Package: zoo
Version: 1.8-12
Date: 2023-04-11
Title: S3 Infrastructure for Regular and Irregular Time Series (Z's
        Ordered Observations)

Package: sqldf
Version: 0.4-11
Date: 2017-06-23
Title: Manipulate R Data Frames Using SQL
Author: G. Grothendieck <[email protected]>

There are two problems:

  • we can have multiline fields with continuation lines denoted by initial whitespace
  • the records can have different fields

R has read.dcf/write.dcf to handle this type of file but I wondered if it were possible in Miller.

I have made some partial progress. If we remove the continued lines then this seems to work but we have the first problem above and if we use --ocsv in place of --ojson we have the second problem too.

grep -v "^\s+\S" desc.dcf | mlr --ixtab --ojson --ips=":" cat

What would be desired is to get a csv output with multiline fields and be able to use all fields.

ggrothendieck avatar May 10 '25 14:05 ggrothendieck

Hi, running

sed ':a;N;$!ba;s/\n \+/~/g' input.txt | mlr --ips ": " --ixtab --ocsv  put '$Title=gsub($Title,"~","\n")'

you get a CSV with multiline fields

Package,Version,Date,Title
zoo,1.8-12,2023-04-11,"S3 Infrastructure for Regular and Irregular Time Series (Z's
Ordered Observations)"
sqldf,0.4-11,2017-06-23,Manipulate R Data Frames Using SQL,G. Grothendieck <[email protected]>

aborruso avatar May 10 '25 16:05 aborruso

Thanks. The Author field header is missing and in general there can be many fields that were not in previoius records, it is not generic because one must know in advance that the Title field was multiline - the actual data has more fields, the sed is a bit complex and I was hoping for a straight-forward mlr solution but it is certainly closer than what I had.

ggrothendieck avatar May 10 '25 17:05 ggrothendieck

One other feature that read.dcf supports is that it allows multiple instances of the same field name in a record. If the all= argument of read.dcf is TRUE then all instances are included so that the column becomes a list of character vectors; otherwise, only the last instance is used in records where they occur multiple times.

A: a
B: 1
B: 2

A: b
B: 3

A: c
B: 4
B: 5
B: 6

I assume that for all=TRUE that the above would result in this CSV file.

A,B
a,"1,2"
b,3
c,"4,5,6"

ggrothendieck avatar May 10 '25 17:05 ggrothendieck

@ggrothendieck this is awesome!

XTAB format isn't aware of continunation lines (although this is a nice feature).

One idea is making a DCF format, which is much like XTAB (and could reuse much of the same code) but (a) defaults to PS of : , and (b) respects continuation lines.

johnkerl avatar May 10 '25 19:05 johnkerl

@johnkerl, That would certainly be desirable. As far as I know there are no other command line utilities that support dcf so Miller would be the only one. Both the ability to read and to write dcf would be nice as R can read and write that format and can do so without needing any addon packages.

ggrothendieck avatar May 10 '25 21:05 ggrothendieck

One additional feature not shown by the previous example is that dcf format can have duplicate names in a record. Different names can have different numbers of duplicates in the same record and different records do not necessarily have the same sets of names. For example

a: 123
b: 6
c: 61
b: 7
c: 62

a: 456
b: 16
b: 17

In R there is an all= argument to the read.dcf function. If it is FALSE then only the last of each set of duplicates is used whereas if it is TRUE then all duplicates are used. In the all=TRUE case duplicate names result in an R list column which is a list of vectors (which do not have tohave the same lengths).

There is some question on how to represent this. This might be represented in csv as any of these. Ideally the user could choose the representation or it would choose one but it would be possible to transform from one to another.

a,b,c
123,"6,7","61,62"
456,"16,17",

or

a,b.1,b.2,c.1,c.2
123,6,7,61,62
456,16,17,,

or

a,b,c
123,6,61
123,7,62
456,16,
456,17,

In JSON the duplicates could be represented as a vector of duplicates for each name since normally JSON does not have duplicate names.

ggrothendieck avatar May 15 '25 15:05 ggrothendieck