CSV API does not always return CSV file
Example dataset: https://www.openml.org/api/v1/json/data/397
When I try to download CSV: https://www.openml.org/data/v1/get_csv/52509
One data row looks like:
{11 7,16 3,205 1,258 2,264 2,272 2,1585 1,1691 1,1939 1,1955 4,1958 1,1966 2,2024 1,2034 1,2047 1,2085 1,2127 1,2149 3,2332 2,2342 1,2370 1,2462 3,2558 1,2594 2,2603 1,2616 1,2623 1,2647 1,2653 2,2695 2,2857 1,3016 1,3046 2,3071 1,3104 1,3125 1,3130 2,3132 2,3154 1,3208 3,3224 3,3262 2,3265 2,3270 1,3272 1,3424 1,3470 1,3483 1,3495 1,3501 1,3561 1,3564 2,3584 1,3621 1,3754 3,3770 1,3787 1,3790 1,3816 2,3904 1,3905 1,3961 1,4122 1,4133 1,4136 1,4164 1,4236 1,4238 1,4239 1,4253 1,4257 1,4267 1,4288 1,4323 1,4331 1,4339 1,4473 4,4488 1,4543 1,4590 2,4606 1,4607 1,4744 1,4758 2,4816 1,4871 1,4996 1,5009 1,5041 1,5044 1,5049 2,5093 1,5114 1,5344 1,5350 1,5366 1,5438 3,5506 1,5507 1,5509 1,5521 4,5522 2,5523 1,5524 1,5525 1,5526 2,5535 1,5536 1,5537 1,5571 1,5578 1,5580 3,5609 2,5804 78}
I think the problem is that original dataset is sparse. It seems the converter to CSV files for sparse datasets does not work correctly?
Yes exactly. If a dataset is sparse the get_csv API gives you a sparse representation. I'm not sure if it would be wise to expand to a dense representation in CSV. Sparse datasets are sparse for a reason. The dense representation could be huge.
What is a "sparse representation" in CSV? Is that standardized somewhere?
No, I don't think that that exists? The (very simple) representation we use goes back to this: https://www.cs.waikato.ac.nz/ml/weka/arff.html If there is a better way to represent sparse data in CSV, that would be good to know :).
I mean, the problem is that data row does not look like a valid CSV file. So when one automatically tries to process such files in an AutoML system, the issue is that parsing just fails.
Maybe you should just return a failure on sparse files? Or convert it to a proper CSV in some sparse representation. I do not think it matters which, just that it is documented and that one can parse it with regular CSV parser.
Maybe this is a feature of Pandas DataFrame with sparse columns which is missing?
Duplicate of #628.