miller
miller copied to clipboard
Documentation on joins. How do I not omit results non-matches?
Hi. I have been trying to get mlr to do various joins. I feel like the documentation can provide some examples as a helper to users. Here is a reproducible example of what I am trying to accomplish. I am not sure how to get the results I am looking for.
Data
cat <<EOF >data/x.csv
a,b,c
a,t,1
b,u,2
c,v,3
EOF
cat data/x.csv | tv
cat <<EOF >data/y.csv
e,f,g
a,t,3
b,u,2
d,w,1
EOF
Left join
Try:
mlr --icsv --ocsv join -j a -r e -f data/x.csv data/y.csv
Expected
a,b,c,e,f,g
a,t,1,a,t,3
b,u,2,b,u,2
c,v,3,,,
Actual
a,b,c,f,g
a,t,1,t,3
b,u,2,u,2
Right join
Try:
mlr --icsv --ocsv join -j e -r a -f data/y.csv data/x.csv
Expected
e,f,g,a,b,c
a,t,3,a,t,1
b,u,2,b,u,2
d,w,1,,,
Actual
e,f,g,b,c
a,t,3,t,1
b,u,2,u,2
Inner join
Use what I did for left join
Full Join
idk
Hi @alexhallam ! The way Miller handles non-matches is with the --ul and --ur flags to join.
That said, the semantics of how Miller does joins aren't well-mapped to some standard SQL-database terminology. :(
The mlr join verb could be enhanced to do what you suggest -- I suspect you're not the only person who would benefit from it. :)
I think if there were some more examples of the different kinds of joins in the documentation then that would help a lot.
It seems that when I use --ul or --ur the output is not in a single csv. Is this what I should expect?
Sticking with the above example I am doing this:
mlr --icsv --ocsv join -j a -r e --ul --ur -f data/x.csv data/y.csv
getting this:
a,b,c,f,g
a,t,1,t,3
b,u,2,u,2
e,f,g
d,w,1
a,b,c
c,v,3
but expecting this:
a,b,c,e,f,g
a,t,1,a,t,3
b,u,2,b,u,2
c,v,3,,,
I am curious if there is functionality to do a left-join and get a single csv output.
I am curious if there is functionality to do a left-join and get a single csv output.
You must add unsparsify verb.
Running
mlr --icsv --ocsv join -j a -r e --ul --ur -f x.csv then unsparsify y.csv
you will have
a,b,c,f,g,e
a,t,1,t,3,
b,u,2,u,2,
,,,w,1,d
c,v,3,,,
You have it also in the FAQ https://miller.readthedocs.io/en/latest/faq.html#how-to-rectangularize-after-joins-with-unpaired
In my opinion, when you have CSV output, the unsparsify task should be applied by default, because that kind of output is wrong for a CSV.
If the output format is something like JSONlines, is not a problem, because it doesn't have to be a regular grid made up of xxx rows by yyy columns
{ "a": "a", "b": "t", "c": 1, "f": "t", "g": 3 }
{ "a": "b", "b": "u", "c": 2, "f": "u", "g": 2 }
{ "e": "d", "f": "w", "g": 1 }
{ "a": "c", "b": "v", "c": 3
Thanks for the reply @aborruso. This is what I was looking for. I now see it in the link. I just did not have the word 'unpaired' come to mind to search for it.
I also think your statement makes sense
In my opinion, when you have CSV output, the unsparsify task should be applied by default, because that kind of output is wrong for a CSV.
Also, just wanted to say thanks @johnkerl for miller. Such a great package!
Thanks @alexhallam & @aborruso. Indeed, unpaired is Miller jargon & the docs need to be clearer about how this maps back and forth with more database-standard terminology like left join and right join etc.
There is a cmd tool trdsql, aiming at sql query for csv data, keep to database-standard.
See also https://github.com/johnkerl/miller/issues/275