miller icon indicating copy to clipboard operation
miller copied to clipboard

Documentation on joins. How do I not omit results non-matches?

Open alexhallam opened this issue 4 years ago • 9 comments

Hi. I have been trying to get mlr to do various joins. I feel like the documentation can provide some examples as a helper to users. Here is a reproducible example of what I am trying to accomplish. I am not sure how to get the results I am looking for.

Data

cat <<EOF >data/x.csv
a,b,c
a,t,1
b,u,2
c,v,3
EOF
cat data/x.csv | tv
cat <<EOF >data/y.csv
e,f,g
a,t,3
b,u,2
d,w,1
EOF

Left join

Try:

mlr --icsv --ocsv join -j a -r e -f data/x.csv data/y.csv

Expected

a,b,c,e,f,g
a,t,1,a,t,3
b,u,2,b,u,2
c,v,3,,,

Actual

a,b,c,f,g
a,t,1,t,3
b,u,2,u,2

Right join

Try:

mlr --icsv --ocsv join -j e -r a -f data/y.csv data/x.csv

Expected

e,f,g,a,b,c
a,t,3,a,t,1
b,u,2,b,u,2
d,w,1,,,

Actual

e,f,g,b,c
a,t,3,t,1
b,u,2,u,2

Inner join

Use what I did for left join

Full Join

idk

alexhallam avatar Sep 07 '21 22:09 alexhallam

Hi @alexhallam ! The way Miller handles non-matches is with the --ul and --ur flags to join.

That said, the semantics of how Miller does joins aren't well-mapped to some standard SQL-database terminology. :(

The mlr join verb could be enhanced to do what you suggest -- I suspect you're not the only person who would benefit from it. :)

johnkerl avatar Sep 08 '21 02:09 johnkerl

I think if there were some more examples of the different kinds of joins in the documentation then that would help a lot.

alexhallam avatar Sep 08 '21 03:09 alexhallam

It seems that when I use --ul or --ur the output is not in a single csv. Is this what I should expect?

alexhallam avatar Sep 08 '21 03:09 alexhallam

Sticking with the above example I am doing this:

mlr --icsv --ocsv join -j a -r e --ul --ur -f data/x.csv data/y.csv

getting this:

a,b,c,f,g
a,t,1,t,3
b,u,2,u,2

e,f,g
d,w,1

a,b,c
c,v,3

but expecting this:

a,b,c,e,f,g
a,t,1,a,t,3
b,u,2,b,u,2
c,v,3,,,

I am curious if there is functionality to do a left-join and get a single csv output.

alexhallam avatar Sep 08 '21 04:09 alexhallam

I am curious if there is functionality to do a left-join and get a single csv output.

You must add unsparsify verb.

Running

mlr --icsv --ocsv join -j a -r e --ul --ur -f x.csv then unsparsify y.csv

you will have

a,b,c,f,g,e
a,t,1,t,3,
b,u,2,u,2,
,,,w,1,d
c,v,3,,,

You have it also in the FAQ https://miller.readthedocs.io/en/latest/faq.html#how-to-rectangularize-after-joins-with-unpaired

In my opinion, when you have CSV output, the unsparsify task should be applied by default, because that kind of output is wrong for a CSV.

If the output format is something like JSONlines, is not a problem, because it doesn't have to be a regular grid made up of xxx rows by yyy columns

{ "a": "a", "b": "t", "c": 1, "f": "t", "g": 3 }
{ "a": "b", "b": "u", "c": 2, "f": "u", "g": 2 }
{ "e": "d", "f": "w", "g": 1 }
{ "a": "c", "b": "v", "c": 3

aborruso avatar Sep 08 '21 06:09 aborruso

Thanks for the reply @aborruso. This is what I was looking for. I now see it in the link. I just did not have the word 'unpaired' come to mind to search for it.

I also think your statement makes sense

In my opinion, when you have CSV output, the unsparsify task should be applied by default, because that kind of output is wrong for a CSV.

Also, just wanted to say thanks @johnkerl for miller. Such a great package!

alexhallam avatar Sep 09 '21 14:09 alexhallam

Thanks @alexhallam & @aborruso. Indeed, unpaired is Miller jargon & the docs need to be clearer about how this maps back and forth with more database-standard terminology like left join and right join etc.

johnkerl avatar Sep 09 '21 15:09 johnkerl

There is a cmd tool trdsql, aiming at sql query for csv data, keep to database-standard.

amitbha avatar Oct 09 '21 10:10 amitbha

See also https://github.com/johnkerl/miller/issues/275

johnkerl avatar Feb 24 '22 04:02 johnkerl