skrub Better _repr_ for the Bunch object used to hold the datasets

Problem Description

Currently, when the Bunch object is displayed in a notebook cell it uses the standard dict repr, which makes it hard to actually see the content, including the name of the keys (sometimes it's X, y and data, in other cases they have names relative to the dataset).

Feature Description

It would be nice to have a clearer repr.

Alternative Solutions

No response

Additional Context

No response

Jun 18 '25 14:06 rcap107

Hello @rcap107!
Shouldn't this be something defined in scikit-learn class itself?

Jun 20 '25 12:06 MarieSacksick

Hello @rcap107! Shouldn't this be something defined in scikit-learn class itself?

Hey @MarieSacksick! In skrub, the Bunch class is only used for fetching the datasets, so I think it would be simpler to either extend it here than updating the scikit-learn class

Jun 20 '25 13:06 rcap107

On this topic, I feel that exposing filenames and changing some examples to load from filename is a higher priority than changing the repr

Jun 24 '25 06:06 GaelVaroquaux

On this topic, I feel that exposing filenames and changing some examples to load from filename is a higher priority than changing the repr

agreed, this issue is lower priority and mostly here to keep track of it

Jun 24 '25 07:06 rcap107

On this topic, I feel that exposing filenames and changing some examples to load from filename is a higher priority than changing the repr

You mean using the expressions or directly via {pd, pl}.read_{csv, parquet}?

Jun 25 '25 10:06 Vincent-Maladiere

You mean using the expressions or directly via {pd, pl}.read_{csv, parquet}?

Directly via the readers.

But it opens the door to having other examples that demonstrate I/O patterns with expressions

Jun 25 '25 11:06 GaelVaroquaux