Question related to unsparsify
Originally posted by @aborruso in https://github.com/johnkerl/miller/issues/1418#issuecomment-1940962653:
@johnkerl I have a unsparsify related question.
I have this input csv:
id,Year,Neighbourhood_name,Category,Gender,Amount
1,2019,Emilstorp,0-5 years,Male,15
2,2019,Emilstorp,6-15 years,Female,25
3,2021,Emilstorp,0-5 years,Male,20
If I run
mlr --csv cut -x -f Gender then reshape -s Category,Amount input.csv
I have this error:
mlr: CSV schema change: first keys "id,Year,Neighbourhood_name,0-5 years"; current keys "id,Year,Neighbourhood_name,6-15 years"
mlr: exiting due to data error.
It's a wrong reshape, because I must cut Gender and id, but If I change format --c2m, I have no error, probably because the unsparsify command is not forced.
So it is probably okay to have that error, but for the non-expert user it is a message that does not help to find the solution. What do you think about? I'm not making any suggestions for improvement, because I don't have any at the moment.
Thank you
@aborruso most definitely this cannot be accommodated within CSV output format:
$ mlr --c2j cut -x -f Gender then reshape -s Category,Amount input.csv
[
{
"id": 1,
"Year": 2019,
"Neighbourhood_name": "Emilstorp",
"0-5 years": 15
},
{
"id": 2,
"Year": 2019,
"Neighbourhood_name": "Emilstorp",
"6-15 years": 25
},
{
"id": 3,
"Year": 2021,
"Neighbourhood_name": "Emilstorp",
"0-5 years": 20
}
]
This is not at all related to unsparsify. It's because CSV output must have the same keys for all rows. I had really hoped the error message we're seeing now would be clear ... and to me it is ... but it is not clear to everyone ... 🤔
This is not at all related to unsparsify. It's because CSV output must have the same keys for all rows. I had really hoped the error message we're seeing now would be clear ... and to me it is ... but it is not clear to everyone ... 🤔
Please don't hate me if I write stupid things now. But wasn't automatic unsparsify introduced if the output is csv?
I'll try to explain with examples. If I use Miller 5 I have this output, I have a sparsified output:
id,Year,Neighbourhood_name,0-5 years
1,2019,Emilstorp,15
id,Year,Neighbourhood_name,6-15 years
2,2019,Emilstorp,25
id,Year,Neighbourhood_name,0-5 years
3,2021,Emilstorp,20
If I use Miller 6 I have the error, because CSV output must have the same keys for all rows. Why I do not have the same output of 5?
If I in fact apply unsparsify in 6 I have no error
mlrgo --csv cut -x -f Gender then reshape -s Category,Amount then unsparsify input.csv
So I thought that without unsparsify I could have in Miller 6 one of these two outputs:
- either the one equal to Miller 5;
- or the one with automatically applied unsparsify
For the error message, you are right, it is understandable.
Dear @johnkerl I probably wasn't very clear and I'll try to explain again.
If I have this input
{"a":3,"b":"hello"}
{"a":2}
I can write mlr --ijsonl --ocsv cat input.jsonl, without the need to add the unsparsify command. It is applied by default, since the output is CSV.
Instead if I have this input
id,Year,Neighbourhood_name,Category,Gender,Amount
1,2019,Emilstorp,0-5 years,Male,15
2,2019,Emilstorp,6-15 years,Female,25
3,2021,Emilstorp,0-5 years,Male,20
and I run mlrgo --csv cut -x -f Gender then reshape -s Category,Amount input.csv, I must add the verb unsparsify, although the output here is also CSV. Otherwise I have an error.
Couldn't it always be put as a final verb, implied, whenever the output is a rectangular format?
Thank you
@aborruso this is one of those cases where we would need to read all output before procuding any output, and I'm not comfortable doing that as a default behavior. That would break Miller's streaming-when-it-can feature, which is one of its great strengths, only to accommodate some corner-case data. Since the data being produced here are irregular, manually specified unsparsify is the correct approach.
Thank you very much
@johnkerl for me you can close this. I get confused, because I can never read in the documentation that this automatic behavior only occurs when "streaming-when-it-can". It's me who doesn't see it, I'm sure it will be explained very well.
Thank you and sorry for this somewhat off-topic and erroneous issue