miller icon indicating copy to clipboard operation
miller copied to clipboard

Auto-format based on input-file extension

Open maxigit opened this issue 2 years ago • 3 comments

I've just discovered this tool which looks great. I often have to mess with csv files and this tool seems awesome.

However, I was wondering if there were an option (or a way) to guess the input format based on the file extension. For example mydata.csv could defaulted to --icvs, mydata.tsv to --itsv etc ...

maxigit avatar Jan 31 '23 11:01 maxigit

@maxigit this could be done -- with two caveats.

One is the amount of work -- not saying this can't be done, but it makes things a little more trivial -- at present in the implementation code there's a type-specific reader (like ReaderReaderCSV or RecordReaderJSON) insntantiated outside the main loop over input files. Then that reader operates over all files. With this change you propose (which is a very good one, and one I've considered before), it would need a bit of a refactor to have the reader object instantiated once per file. Also note that currently if one input file is CSV, they all must be -- with your feature, we could do mlr ... foo.csv bar.json with various files types handled all in one run.

The other caveat is reading from standard input: e.g. mlr ... < somefile.csv or ... | mlr ... -- here there is no filename extension provided to the executable. I have a little worry that some users might feel slighted that autodetect doesn't work here. This problem is manageable with suitable caveats to the user.

In any case -- both problems are solvable, and this is a great idea :)

johnkerl avatar Feb 04 '23 04:02 johnkerl

The refactoring issue is at the same time a very good reason and a very bad one, but I understand that time is always the issue. However, it seems that a good enhancement to be able to have input files with different format. I didn't realize that was the case (or thought you could insert a --i... for each file, as it is the case with join). I thought that a quick workaround could have been to just "transform" the command like so that foo.csv becomes --icsv foo.csv.

I understand the standard input argument, but I don't think is an issue in practice. One way is to do nothing (current behavior), try to guess the format by parsing the first line (give good results 99% of the time) or even try different Reader and see which one gives the best result. I'm sure this can wait and could be another feature "guess format on content".

maxigit avatar Feb 04 '23 12:02 maxigit

[...] try to guess the format by parsing the first line (give good results 99% of the time) or even try different Reader and see which one gives the best result. I'm sure this can wait and could be another feature "guess format on content".

For some other work in this area, that I'm familiar with, see for example:

  • Tablib loops through its installed file formats one at a time, and runs a .detect() method on them to see if they want to load the file, string, or binary blob provided. This is basically what you suggested, and it works pretty well in most cases. (source code)
    • The order of evaluation matters here. Common formats need to be first in the list, in order to minimize the risk of false detection or slow performance.
  • csv.Sniffer in the Python standard library has some heuristics to detect CSV file format, and also the presence of headers. (source code).
    • Only handles csv/tsv. This is not perfect, and the delimiter detection is too aggressive unless you give it a restricted set of possibilities (which is what Tablib does), but the code has been in the Python standard library forever, and should be pretty battle tested by now.
  • lesspipe.sh runs file -L -s -b --mime on the file rather than depending on the file extension, if possible. (source code)
    • Using file magic allows it to autodetect e.g. CSV or JSON even if the file extension is *.txt, for instance, or there's no extension at all.
    • Note that file is not 100% accurate, so this is not entirely reliable.

harkabeeparolus avatar Jan 25 '24 06:01 harkabeeparolus