miller icon indicating copy to clipboard operation
miller copied to clipboard

Question related to unsparsify

Open johnkerl opened this issue 1 year ago • 2 comments

Originally posted by @aborruso in https://github.com/johnkerl/miller/issues/1418#issuecomment-1940962653:

@johnkerl I have a unsparsify related question.

I have this input csv:

id,Year,Neighbourhood_name,Category,Gender,Amount
1,2019,Emilstorp,0-5 years,Male,15
2,2019,Emilstorp,6-15 years,Female,25
3,2021,Emilstorp,0-5 years,Male,20

If I run

mlr --csv cut -x -f Gender then reshape -s Category,Amount input.csv

I have this error:

mlr: CSV schema change: first keys "id,Year,Neighbourhood_name,0-5 years"; current keys "id,Year,Neighbourhood_name,6-15 years"
mlr: exiting due to data error.

It's a wrong reshape, because I must cut Gender and id, but If I change format --c2m, I have no error, probably because the unsparsify command is not forced.

So it is probably okay to have that error, but for the non-expert user it is a message that does not help to find the solution. What do you think about? I'm not making any suggestions for improvement, because I don't have any at the moment.

Thank you

johnkerl avatar Feb 13 '24 14:02 johnkerl

@aborruso most definitely this cannot be accommodated within CSV output format:

$ mlr --c2j cut -x -f Gender then reshape -s Category,Amount input.csv
[
{
  "id": 1,
  "Year": 2019,
  "Neighbourhood_name": "Emilstorp",
  "0-5 years": 15
},
{
  "id": 2,
  "Year": 2019,
  "Neighbourhood_name": "Emilstorp",
  "6-15 years": 25
},
{
  "id": 3,
  "Year": 2021,
  "Neighbourhood_name": "Emilstorp",
  "0-5 years": 20
}
]

This is not at all related to unsparsify. It's because CSV output must have the same keys for all rows. I had really hoped the error message we're seeing now would be clear ... and to me it is ... but it is not clear to everyone ... 🤔

johnkerl avatar Feb 18 '24 17:02 johnkerl

This is not at all related to unsparsify. It's because CSV output must have the same keys for all rows. I had really hoped the error message we're seeing now would be clear ... and to me it is ... but it is not clear to everyone ... 🤔

Please don't hate me if I write stupid things now. But wasn't automatic unsparsify introduced if the output is csv?

I'll try to explain with examples. If I use Miller 5 I have this output, I have a sparsified output:

id,Year,Neighbourhood_name,0-5 years
1,2019,Emilstorp,15

id,Year,Neighbourhood_name,6-15 years
2,2019,Emilstorp,25

id,Year,Neighbourhood_name,0-5 years
3,2021,Emilstorp,20

If I use Miller 6 I have the error, because CSV output must have the same keys for all rows. Why I do not have the same output of 5?

If I in fact apply unsparsify in 6 I have no error

mlrgo --csv cut -x -f Gender then reshape -s Category,Amount then unsparsify  input.csv

So I thought that without unsparsify I could have in Miller 6 one of these two outputs:

  • either the one equal to Miller 5;
  • or the one with automatically applied unsparsify

For the error message, you are right, it is understandable.

aborruso avatar Feb 18 '24 17:02 aborruso

Dear @johnkerl I probably wasn't very clear and I'll try to explain again.

If I have this input

{"a":3,"b":"hello"}
{"a":2}

I can write mlr --ijsonl --ocsv cat input.jsonl, without the need to add the unsparsify command. It is applied by default, since the output is CSV.

Instead if I have this input

id,Year,Neighbourhood_name,Category,Gender,Amount
1,2019,Emilstorp,0-5 years,Male,15
2,2019,Emilstorp,6-15 years,Female,25
3,2021,Emilstorp,0-5 years,Male,20

and I run mlrgo --csv cut -x -f Gender then reshape -s Category,Amount input.csv, I must add the verb unsparsify, although the output here is also CSV. Otherwise I have an error.

Couldn't it always be put as a final verb, implied, whenever the output is a rectangular format?

Thank you

aborruso avatar Feb 25 '24 12:02 aborruso

@aborruso this is one of those cases where we would need to read all output before procuding any output, and I'm not comfortable doing that as a default behavior. That would break Miller's streaming-when-it-can feature, which is one of its great strengths, only to accommodate some corner-case data. Since the data being produced here are irregular, manually specified unsparsify is the correct approach.

johnkerl avatar Feb 26 '24 05:02 johnkerl

Thank you very much

aborruso avatar Feb 26 '24 06:02 aborruso

@johnkerl for me you can close this. I get confused, because I can never read in the documentation that this automatic behavior only occurs when "streaming-when-it-can". It's me who doesn't see it, I'm sure it will be explained very well.

Thank you and sorry for this somewhat off-topic and erroneous issue

aborruso avatar Feb 26 '24 07:02 aborruso