miller icon indicating copy to clipboard operation
miller copied to clipboard

Run a join, after renaming a field and uncompressing one file

Open aborruso opened this issue 5 years ago • 11 comments
trafficstars

Hi, I have two files. One is 01.csv;

key
7777303178
772326718D

The other is a zip (02.zip), and it contains:

KEY,field_01
7777303178,a
772326718D,b
589893,c

I want to join these, but I must uncompress the second and I need to rename the key field to do the JOIN: the right name is uppercase (KEY). I would like to run it in one Miller command. Is it possible to do it?

If I run

mlr --prepipe-gunzip --csv join -j KEY -f ./02.zip then rename key,KEY ./01.csv

I have

gzip: stdin: not in gzip format
mlr: Header/data length mismatch (1 != 2) at file "./02.zip" line 2.

What's my error?

Thank you

aborruso avatar Sep 28 '20 20:09 aborruso

gzip and zip are different formats which do different things.

gzip is for compressing single files. tar is for collecting multiple files into one. doing both (.tar.gz or .tgz) is a common idiom. zip is like .tgz conceptually, except the algorithm is different. The key point is thatzip not only compresses, but compresses a collection of files not just the bytes of one file.

The 02.zip file is an archive containing a length-1 list of files, the element of which is 02.csv.

If you try gunzip < 02.zip (which is what you're asking Miller to do with 02.zip) then you'll see the unknown-format error message.

johnkerl avatar Sep 29 '20 01:09 johnkerl

@johnkerl if I run gunzip < 02.zip I have no error:

KEY,field_01
7777303178,a
772326718D,b
589893,c

Then Miller should open it. Am I wrong?

Thank you

image

aborruso avatar Sep 29 '20 06:09 aborruso

Moreover I have the same error using a real gzip file (02.csv.gz):

gzip: stdin: not in gzip format

aborruso avatar Sep 29 '20 10:09 aborruso

What platform are you on? Can you do uname -a and gzip --version and send those please?

johnkerl avatar Sep 29 '20 13:09 johnkerl

Hi @johnkerl , I use debian via Windows Subsystem for Linux 2, that it's a full linux kernel.

➜ cat /etc/issue
Debian GNU/Linux bullseye/sid
➜  gzip --version
gzip 1.10
Copyright (C) 2018 Free Software Foundation, Inc.
Copyright (C) 1993 Jean-loup Gailly.
This is free software.  You may redistribute copies of it under the terms of
the GNU General Public License <https://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.
➜ uname -a
Linux DESKTOP-7NVNDNF 4.19.104-microsoft-standard #1 SMP Wed Feb 19 06:37:35 UTC 2020 x86_64 GNU/Linux

Thank you

aborruso avatar Sep 29 '20 14:09 aborruso

And I have no problem running mlr --csv --prepipe-gunzip filter '${field_01}=="a"' 02.csv.gz or mlr --csv --prepipe-gunzip filter '${field_01}=="a"' 02.zip, I have always:

KEY,field_01
7777303178,a

But I'm not able to use it in that join

aborruso avatar Sep 29 '20 14:09 aborruso

@johnkerl do you think that it does not work, because I use a "strange" OS?

Thank you

aborruso avatar Oct 02 '20 09:10 aborruso

Sorry for the delay.

You had different symptoms than me for sure, with the gunzip < 02.zip working for you and not for me. Maybe that's me with the "strange" OS (MacOS); I don't know.

I think what you want is not

mlr --prepipe-gunzip --csv join -j KEY -f ./02.csv.gz then rename key,KEY ./01.csv

but rather

mlr --csv join --prepipe-gunzip -j KEY -f ./02.csv.gz then rename key,KEY ./01.csv

since the gzipped file is the join-file (02.csv.gz) and not the main-file (01.csv) -- so the join subcommand, not the main mlr command, should have --prepipe-gunzip.

johnkerl avatar Oct 02 '20 14:10 johnkerl

... but there is still something else going on.

To separate out the platform-difference issues I created 02.csv and 02.csv.gz.

This works:

$ mlr --csv join  -l KEY -r key -j key -f ./02.csv then rename key,KEY ./01.csv
KEY,field_01
7777303178,a
772326718D,b

but this produces no output:

$ mlr --csv join --prepipe-gunzip -l KEY -r key -j key -f ./02.csv.gz then rename key,KEY ./01.csv

johnkerl avatar Oct 02 '20 14:10 johnkerl

(I will need to dig into this more later. Sorry again for the delay.)

johnkerl avatar Oct 02 '20 14:10 johnkerl

but this produces no output:

$ mlr --csv join --prepipe-gunzip -l KEY -r key -j key -f ./02.csv.gz then rename key,KEY ./01.csv

The same for me, thank you very much!!

aborruso avatar Oct 02 '20 14:10 aborruso