miller
miller copied to clipboard
Run a join, after renaming a field and uncompressing one file
Hi,
I have two files. One is 01.csv;
key
7777303178
772326718D
The other is a zip (02.zip), and it contains:
KEY,field_01
7777303178,a
772326718D,b
589893,c
I want to join these, but I must uncompress the second and I need to rename the key field to do the JOIN: the right name is uppercase (KEY).
I would like to run it in one Miller command. Is it possible to do it?
If I run
mlr --prepipe-gunzip --csv join -j KEY -f ./02.zip then rename key,KEY ./01.csv
I have
gzip: stdin: not in gzip format
mlr: Header/data length mismatch (1 != 2) at file "./02.zip" line 2.
What's my error?
Thank you
gzip and zip are different formats which do different things.
gzip is for compressing single files. tar is for collecting multiple files into one. doing both (.tar.gz or .tgz) is a common idiom. zip is like .tgz conceptually, except the algorithm is different. The key point is thatzip not only compresses, but compresses a collection of files not just the bytes of one file.
The 02.zip file is an archive containing a length-1 list of files, the element of which is 02.csv.
If you try gunzip < 02.zip (which is what you're asking Miller to do with 02.zip) then you'll see the unknown-format error message.
@johnkerl if I run gunzip < 02.zip I have no error:
KEY,field_01
7777303178,a
772326718D,b
589893,c
Then Miller should open it. Am I wrong?
Thank you

What platform are you on? Can you do uname -a and gzip --version and send those please?
Hi @johnkerl , I use debian via Windows Subsystem for Linux 2, that it's a full linux kernel.
➜ cat /etc/issue
Debian GNU/Linux bullseye/sid
➜ gzip --version
gzip 1.10
Copyright (C) 2018 Free Software Foundation, Inc.
Copyright (C) 1993 Jean-loup Gailly.
This is free software. You may redistribute copies of it under the terms of
the GNU General Public License <https://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.
➜ uname -a
Linux DESKTOP-7NVNDNF 4.19.104-microsoft-standard #1 SMP Wed Feb 19 06:37:35 UTC 2020 x86_64 GNU/Linux
Thank you
And I have no problem running mlr --csv --prepipe-gunzip filter '${field_01}=="a"' 02.csv.gz or mlr --csv --prepipe-gunzip filter '${field_01}=="a"' 02.zip, I have always:
KEY,field_01
7777303178,a
But I'm not able to use it in that join
@johnkerl do you think that it does not work, because I use a "strange" OS?
Thank you
Sorry for the delay.
You had different symptoms than me for sure, with the gunzip < 02.zip working for you and not for me. Maybe that's me with the "strange" OS (MacOS); I don't know.
I think what you want is not
mlr --prepipe-gunzip --csv join -j KEY -f ./02.csv.gz then rename key,KEY ./01.csv
but rather
mlr --csv join --prepipe-gunzip -j KEY -f ./02.csv.gz then rename key,KEY ./01.csv
since the gzipped file is the join-file (02.csv.gz) and not the main-file (01.csv) -- so the join subcommand, not the main mlr command, should have --prepipe-gunzip.
... but there is still something else going on.
To separate out the platform-difference issues I created 02.csv and 02.csv.gz.
This works:
$ mlr --csv join -l KEY -r key -j key -f ./02.csv then rename key,KEY ./01.csv
KEY,field_01
7777303178,a
772326718D,b
but this produces no output:
$ mlr --csv join --prepipe-gunzip -l KEY -r key -j key -f ./02.csv.gz then rename key,KEY ./01.csv
(I will need to dig into this more later. Sorry again for the delay.)
but this produces no output:
$ mlr --csv join --prepipe-gunzip -l KEY -r key -j key -f ./02.csv.gz then rename key,KEY ./01.csv
The same for me, thank you very much!!