skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Various features and improvements for the skrub.datasets utilities

Open rcap107 opened this issue 6 months ago • 8 comments

I think the skrub.datasets utilities for loading and returning datasets could be improved a bit for for a better user experience:

  • skrub supports both pandas and polars, but datasets are always returned as pandas. It would be good to add an option to return the dataframes as polars dataframes instead. This could be implemented as an option for #1377 ("preferred dataframe engine" for example)
  • Currently, datasets are returned as a Bunch object, which is pretty much a dict that shows the repr of the dataframes when printed in a notebook cell. I think it would be better to have a more compact repr that shows the name and shape of the dataframes in the bunch, as well as metadata info if available.
  • This was mentioned in another issue (#1252), but it would be useful to have more info on the datasets (at least, their size) on the doc entry for the respective function.
  • It would be nice to have an option to save the datasets in parquet, rather than csv (this could be another option for #1377 )

rcap107 avatar May 28 '25 13:05 rcap107

Rather than supporting polars, I would prefer adding the the filenames and changing many of our example to load the data.

I think that this is a better pattern. It's closer to code that people might write

GaelVaroquaux avatar May 28 '25 13:05 GaelVaroquaux

Rather than supporting polars, I would prefer adding the the filenames and changing many of our example to load the data.

I think that this is a better pattern. It's closer to code that people might write

So rather than

data = skrub.datasets.fetch_toxicity()

something like this?

data_dir = skrub.datasets.get_data_dir()

data = pl.read_csv(data_dir / "toxicity/toxicity/toxicity.csv")

ideally less janky than that

rcap107 avatar May 28 '25 13:05 rcap107

Not "get_data_dir": we need to trigger the download.

More:

data = skrub.fetch_toxicity()
df = pl.read_csv(data.filename)

GaelVaroquaux avatar May 28 '25 14:05 GaelVaroquaux

Hey @rcap107, these are interesting ideas! That said, I’d challenge the priority of working on this, as it doesn’t seem to bring much benefit to users.

Vincent-Maladiere avatar May 28 '25 14:05 Vincent-Maladiere

Hey @rcap107, these are interesting ideas! That said, I’d challenge the priority of working on this, as it doesn’t seem to bring much benefit to users.

I agree it's not a particularly urgent issue. I noted it down as I was working on an example in case someone wants to work on it.

rcap107 avatar May 28 '25 14:05 rcap107

This could be a good first issue to suggest during sprints indeed!

Vincent-Maladiere avatar May 28 '25 15:05 Vincent-Maladiere

This could be a good first issue to suggest during sprints indeed!

I would start simple by suggesting the "filename" thing (maybe create a subissue with "good first issue" label)

GaelVaroquaux avatar May 28 '25 15:05 GaelVaroquaux

I'm working on this issue now. I was looking at the datasets currently stored in https://github.com/skrub-data/skrub-data-files, and I was wondering if we should convert all the files in csv to parquet (for all sorts of reasons)

Then, the script that is fetching the datasets would return a Bunch object that contains the path to the (currently csv, later parquet) files instead of the current version that returns pandas dataframes.

~~Just to be sure, I doublechecked that parquet files can be read with the min-reqs configuration, so that wouldn't be a problem.~~ I was baited into a false sense of security. min-reqs can't read parquet, so this is actually a problem.

rcap107 avatar Sep 09 '25 13:09 rcap107