OpenML icon indicating copy to clipboard operation
OpenML copied to clipboard

CSV API does not always return CSV file

Open mitar opened this issue 5 years ago • 5 comments

Example dataset: https://www.openml.org/api/v1/json/data/397

When I try to download CSV: https://www.openml.org/data/v1/get_csv/52509

One data row looks like:

{11 7,16 3,205 1,258 2,264 2,272 2,1585 1,1691 1,1939 1,1955 4,1958 1,1966 2,2024 1,2034 1,2047 1,2085 1,2127 1,2149 3,2332 2,2342 1,2370 1,2462 3,2558 1,2594 2,2603 1,2616 1,2623 1,2647 1,2653 2,2695 2,2857 1,3016 1,3046 2,3071 1,3104 1,3125 1,3130 2,3132 2,3154 1,3208 3,3224 3,3262 2,3265 2,3270 1,3272 1,3424 1,3470 1,3483 1,3495 1,3501 1,3561 1,3564 2,3584 1,3621 1,3754 3,3770 1,3787 1,3790 1,3816 2,3904 1,3905 1,3961 1,4122 1,4133 1,4136 1,4164 1,4236 1,4238 1,4239 1,4253 1,4257 1,4267 1,4288 1,4323 1,4331 1,4339 1,4473 4,4488 1,4543 1,4590 2,4606 1,4607 1,4744 1,4758 2,4816 1,4871 1,4996 1,5009 1,5041 1,5044 1,5049 2,5093 1,5114 1,5344 1,5350 1,5366 1,5438 3,5506 1,5507 1,5509 1,5521 4,5522 2,5523 1,5524 1,5525 1,5526 2,5535 1,5536 1,5537 1,5571 1,5578 1,5580 3,5609 2,5804 78}

I think the problem is that original dataset is sparse. It seems the converter to CSV files for sparse datasets does not work correctly?

mitar avatar Dec 26 '20 02:12 mitar

Yes exactly. If a dataset is sparse the get_csv API gives you a sparse representation. I'm not sure if it would be wise to expand to a dense representation in CSV. Sparse datasets are sparse for a reason. The dense representation could be huge.

joaquinvanschoren avatar Dec 26 '20 22:12 joaquinvanschoren

What is a "sparse representation" in CSV? Is that standardized somewhere?

mitar avatar Dec 27 '20 00:12 mitar

No, I don't think that that exists? The (very simple) representation we use goes back to this: https://www.cs.waikato.ac.nz/ml/weka/arff.html If there is a better way to represent sparse data in CSV, that would be good to know :).

joaquinvanschoren avatar May 23 '21 21:05 joaquinvanschoren

I mean, the problem is that data row does not look like a valid CSV file. So when one automatically tries to process such files in an AutoML system, the issue is that parsing just fails.

Maybe you should just return a failure on sparse files? Or convert it to a proper CSV in some sparse representation. I do not think it matters which, just that it is documented and that one can parse it with regular CSV parser.

Maybe this is a feature of Pandas DataFrame with sparse columns which is missing?

mitar avatar Jun 22 '21 23:06 mitar

Duplicate of #628.

mitar avatar Oct 24 '21 06:10 mitar