miller icon indicating copy to clipboard operation
miller copied to clipboard

Track IFS with file-format flags at CLI parse (was: file format flags are not overriding mlrrc)

Open holmescharles opened this issue 2 years ago • 8 comments

I am using miller 6.8.0 on OSX. Here is my mlrrc:

pprint

When I run the following command, I get the following error:

❯ mlr --csv cat << EOF
a,b c
1,2
3,4
EOF

mlr: mlr: CSV header/data length mismatch 2 != 1 at filename (stdin) row 2.

As you can see, miller is defaulting to pprint even though I passed "--csv". It is interpreting spaces as the delimiter, not the commas.

Next, I change my mlrrc to:

#pprint

Now I can run the previous command with no issue:

❯ mlr --csv cat << EOF
a,b c
1,2
3,4
EOF

a,b c
1,2
3,4

According to the documentation, miller is supposed to override any file formats in the mlrrc when options are passed in the terminal, but this is not happening.

holmescharles avatar Jul 14 '23 12:07 holmescharles

If I run

mlr --csv cat << EOF
a,b c
1,2
3,4
EOF

I have

a,b c
1,2
3,4

But I'm using mlr 6.7.0

aborruso avatar Jul 14 '23 14:07 aborruso

What is your mlrrc?

holmescharles avatar Jul 14 '23 14:07 holmescharles

And I have the same using 6.8.

What is your mlrrc?

I do not use it

aborruso avatar Jul 14 '23 14:07 aborruso

I do not use it

Ok, then that explains why you don't have the same error. Now try setting your mlrrc to match mine.

holmescharles avatar Jul 14 '23 14:07 holmescharles

Ok, then that explains why you don't have the same error. Now try setting your mlrrc to match mine.

the same error

aborruso avatar Jul 14 '23 15:07 aborruso

Hi @holmescharles and thanks for the issue!

I looked into this and it's not quite what any of us thought.

And unfortunately I think you may not like the answer! :) And neither do I ... 😉

When pprint is in the .mlrrc file and --csv is on the command line, the input-file format is correctly found to be PPRINT from the .mlrrc, and then is correctly changed to be CSV. That isn't broken.

But what happens is --pprint, when first encountered, sets IFS from not-specified-yet, to one or more spaces. Then when --csv is encountered, the IFS is already set, and so doesn't get changed to be comma.

This isn't a .mlrrc issue as it first appeared -- rather, you get the same result with

$ mlr --pprint --csv cat << EOF
a,b c
1,2
3,4
EOF
mlr: mlr: CSV header/data length mismatch 2 != 1 at filename (stdin) row 2.

So: a fix could be that --csv should always re-write IFS after --pprint has set it.

This is a bigger issue and it goes back all the way to Miller 1 and involves the fact that these flags are overlapping. So --pprint and --csv set the input file format and IFS; and so on. And from the beginning I wanted people to do either

mlr --ifs ';' --csv ...

or

mlr --csv --ifs ';' ...

-- and Miller has always had that flexibility, and I don't want to take it away. I think it would break a lot of things for a lot of people, to change it now.

So ...... I think the best that can be done at present is this suggestion:

mlr --pprint --csv --ifs comma cat << EOF
a,b c
1,2
3,4
EOF

with the explicit --ifs comma.

I wish I had a better answer, but, at this point in time, and given the long-standing usage pattern above, I don't see a better option.

Later thought: I thought there may be a way to accommodate this but I'm still unsure. Namely:

  • Currently, if IFS is not set, --pprint sets input file format to PPRINT and IFS to one or more spaces.
  • Currently, if IFS is not set, --csv sets input file format to CSV and IFS to comma.
  • The flag for if IFS was set or not is here: https://github.com/johnkerl/miller/blob/6.8.0/internal/pkg/cli/option_types.go#L48-L54
  • What I could do is change it from "if IFS set" to "is IFS set for the last-specified input format"
  • This would still need handling for mlr --ifs ';' --csv ... since there is no last-specified input format when --ifs comes first on the command line ... 🤔

johnkerl avatar Aug 19 '23 18:08 johnkerl

Maybe I'm thinking about this incorrectly, but could you keep track of the give format flag and wait until the end to use that to set things like IFS etc? That way, if the user specifies something else, it waits until the last minute to assign unassigned variables? I'm not sure how things are implemented, but this could just be tracked in some new variable? Then, if you specify --pprint before --csv, it will completely ignore pprint.

Alternatively, pprint is just csv with "contiguous whitespace" as delimiters. For my own curiosity, how would you specify contiguous whitespace as IFS?

holmescharles avatar Aug 21 '23 16:08 holmescharles

Maybe I'm thinking about this incorrectly, but could you keep track of the give format flag and wait until the end to use that to set things like IFS etc?

Yes, you & I are talking about the same thing -- what you just wrote, and my bullet list above.

So --csv --ifs semicolon and --ifs semicolon --csv would do the same, but, if --pprint (or --xtab or whatever) appears after that, it would "reset".

pprint is just csv with "contiguous whitespace" as delimiters

Not quite. --ipprint is the same as --icsvlite --ifs space --repifs. And --opprint is not the same as --o anything else. This is because there's significant code reuse between the csv-lite and pprint readers, whereas the pprint writer has very much its own logic.

johnkerl avatar Aug 21 '23 17:08 johnkerl