OpenML icon indicating copy to clipboard operation
OpenML copied to clipboard

Question: availability of parquet files

Open glemaitre opened this issue 4 years ago • 8 comments

In scikit-learn, we were about to bring a simple new ARFF parser based on pandas.read_csv. In short, it skipped the header, read the dataset and cast the nominal columns (we don't really care about the datetime format). It is from x4-x10 faster and take x2 less memory.

However, we now wonder if we should indeed integrate this parser since it could become obsolete. Basically, it would depend on the timing regarding making the dataset available in parquet format through the OpenML site. I saw in some previous issue that it could be available soon.

Do you have an estimate (even rough) of the timeline for the feature to land?

glemaitre avatar Dec 20 '21 14:12 glemaitre

Hi Guillaume,

I estimate it will be around February 2022. We've converted most of the datasets to parquet, but some take longer (e.g. sparse datasets). Including @prabhant to follow up on this.

This also needs to be merged first: https://github.com/openml/OpenML/pull/1097

joaquinvanschoren avatar Dec 23 '21 20:12 joaquinvanschoren

@mfeurer @joaquinvanschoren I wanted to know if there is some news regarding the parquet format. We could see that the ARFF file is defaulting to old.openml.org and I could as well see that there are some .pq links in the XML.

I was wondering if it would be safe on our side to rely on this info to load parquet dataset? Is there some case that which only the ARFF file will be available and not the parquet file?

glemaitre avatar May 13 '22 10:05 glemaitre

Hi Guillaume,

The only edge case not fully covered yet are sparse datasets. @prabhant can you please give an update? Also, are we renaming 'minio_url' to 'parquet_url'?

Otherwise, it is safe to start using it. Please note that old.openml.org will stay for a good while, also in production. This is to simplify development of an entirely new backend in python.

joaquinvanschoren avatar May 13 '22 11:05 joaquinvanschoren

The only edge case not fully covered yet are sparse datasets.

Cool. At least, we can detect this case by looking at the tag and raising the proper error message to our user to switch on/off parquet.

glemaitre avatar May 13 '22 12:05 glemaitre

Yes, or more generally you can also easily detect when the parquet URL is not available.

joaquinvanschoren avatar May 13 '22 12:05 joaquinvanschoren

Hi, right now it's safe to use the parquet URL for datasets available there. (You'll get an error or 403 if its not available). We are done with converting sparse datasets as well. So after that only very few edge cases will be left(mostly broken datasets that can't even be loaded in pandas).

Note that the first priority of uploading datasets are datasets with 'active' label. After that we will start uploading inactive ones.

The current estimate of uploading sparse datasets is by the first week of June.

prabhant avatar May 24 '22 08:05 prabhant

Hi Guillaume, we are renaming minio_url to parquet_url in the API. We are returning both. Please let me know when you are no longer using minio_url. When there are no more dependencies, we'll remove minio_url. Thanks!

joaquinvanschoren avatar Jun 23 '22 09:06 joaquinvanschoren

@joaquinvanschoren We did not implement yet the feature so we can use directly parquet_url.

glemaitre avatar Jun 27 '22 07:06 glemaitre