skrub [WIP] Add `skrub.Report`

moving skrubview into skrub

Jul 02 '24 10:07 jeromedockes

Nitpick: I think that it should be called "TableReport" :)

Jul 02 '24 12:07 GaelVaroquaux

Nitpick: I think that it should be called "TableReport" :)

good point -- we may want to add other kinds of reports eg on models

Jul 02 '24 13:07 jeromedockes

Hi Jérôme,

I'm soooo much looking forward to this!

When you are getting close to the review, can you please make sure that we have an example that shows the report. It will help reviewing as well as broadcasting the functionality.

Jul 04 '24 15:07 GaelVaroquaux

thanks!!

for now I have added one in example 01, we can see it in the circleCI artifact

I think we're not too far from being able to start the reviews, I'm just making a couple of adjustments for old versions of pandas and matplotlib and I still need to add tests for now they are only smoke tests

Jul 04 '24 15:07 jeromedockes

I have a question: the reports require matplotlib and jinja2; should those be required or optional dependencies? In my experience it is a bit annoying when dependencies for a rather central feature are optional because you have to see the importerror and install the missing package manually, and typical data-science environements will have matplotlib and probably jinja. OTOH most of skrub doesn't need them so it may be bad to ask users who don't need the reports to install them.

It would be nice if the package metadata could distinguish between "required" and "default" dependencies ...

Jul 04 '24 15:07 jeromedockes

the reports require matplotlib and jinja2; should those be required or optional dependencies?

I guess required

In my experience it is a bit annoying when dependencies for a rather central feature are optional because you have to see the importerror and install the missing package manually, and typical data-science environements will have matplotlib and probably jinja.

Right, but they may not be needed on production servers. Let's worry about this later

Jul 04 '24 15:07 GaelVaroquaux

With regards to the tooltips, can you make them appear without waiting, as such: https://www.w3schools.com/css/tryit.asp?filename=trycss_tooltip_arrow_bottom

Jul 09 '24 09:07 GaelVaroquaux

Also, in terms of examples, I don't love that the TableReport clutters a bit an example that is unrelated to it.

How about a new example that's call "Quick start to skrub", that's quite short and shows the TableReport, the tabular_learner, and explains that the tabular_learner creates pipelines with the TableVectorizer ?

Jul 09 '24 09:07 GaelVaroquaux

With regards to the tooltips, can you make them appear without waiting, as such: https://www.w3schools.com/css/tryit.asp?filename=trycss_tooltip_arrow_bottom

Or maybe you'll need to have them appear below, to stay in the div or the report, as the enclosing environment may hide overflow. This can be done as such https://www.w3schools.com/css/tryit.asp?filename=trycss_tooltip_arrow_top

Jul 09 '24 09:07 GaelVaroquaux

Note: to try the command-line script you probably need to install skrub again, even if it was installed in editable mode

pip install -e .
skrub-report /my/file.parquet

Jul 09 '24 09:07 jeromedockes

Another comment (sorry, reviewing by small time lapses): should we add TableReport to the API page?

Jul 09 '24 09:07 GaelVaroquaux

With regards to the tooltips, can you make them appear without waiting

I made this change but looking a them again I'm not sure they are really useful, we could consider removing them altogether

Jul 09 '24 11:07 jeromedockes

Can you add section titles to the examples: one section title per feature that you highlight

Jul 09 '24 13:07 GaelVaroquaux

With regards to the tooltips, can you make them appear without waiting

I made this change but looking a them again I'm not sure they are really useful, we could consider removing them altogether

I think that they are super useful. I would favor keeping them: they help understanding what is going on, but do not clutter

Jul 09 '24 13:07 GaelVaroquaux

I think that they are super useful. I would favor keeping them: they help understanding what is going on, but do not clutter

ok then, let's keep them!

Jul 09 '24 14:07 jeromedockes

I wonder whether it would not be useful to add a border separation: the problem that I see is that in the bottom of the object may not be well separated from what follows below. This happens in the documentation:

I see two ways of doing it:

Full borders all around

.tab-set {
  border: 1px solid var(--darkg);
  border-top: 0px;
  border-radius: 5px;
}

Border only on the bottom, to be lighter:


/* Element | https://output.circle-artifacts.com/output/job/0776ea43-c78d-4fc9-ae32-14fa1899a4f7/artifacts/0/doc/auto_examples/00_quick_start.html#sphx-glr-auto-examples-00-quick-start-py */

.tab-set {
  border-bottom: 1px solid var(--darkg);
}

Maybe the css library that does the tabs also has a way to do this natively, which would probably be preferable.

Jul 09 '24 17:07 GaelVaroquaux

@jeromedockes : tell us when this is ready for merge

Jul 09 '24 20:07 GaelVaroquaux

I wonder if this PR has been mixed with other things: are all lines in the diff related to the PR?

Jul 10 '24 08:07 GaelVaroquaux

I wonder if this PR has been mixed with other things: are all lines in the diff related to the PR?

yes except a few small details (handling a polars deprecation warning that was causing the CI to fail, using a pytest fixture I needed for this PR in 2 other places where it simplified the test). But 2.5k lines of diff are in the pixi.lock due to adding the matplotlib and jinja dependencies

Jul 10 '24 08:07 jeromedockes

@jeromedockes : tell us when this is ready for merge

ok I think probably today, I'm just adding a few docstrings in the private modules

Jul 10 '24 08:07 jeromedockes

But 2.5k lines of diff are in the pixi.lock due to adding the matplotlib and jinja dependencies

and another 2k lines in the package-lock.json of the javascript module used to test the reports. As those package jsons are quite big we may want to consider using the MANIFEST.in to exclude them from the package distribution

Jul 10 '24 08:07 jeromedockes

I'm fine with moving the README to the docs later: I've added an issue here: https://github.com/skrub-data/skrub/pull/984

Jul 10 '24 20:07 GaelVaroquaux

ok I'll have another look tomorrow but I think I have addressed most comments from @GaelVaroquaux

Jul 11 '24 15:07 jeromedockes

Do you want to work on your on-line visualizer and add it in the docs, or should we rather merge before (probably a good idea)?

Jul 15 '24 11:07 GaelVaroquaux

I think we can merge this PR.

If at some point we are happy enough with the online visualizer to add it to the docs we can do another PR. It might be a good idea to wait for the online thing to advertise or release the reports, but not to merge the PR IMO.

I also want to exclude the js_tests/ folder from the distribution package in a MANIFEST.in (it saves a few kb in the wheel), but i'll do that in a dedicated PR because there are other files we may want to exclude -- eg we can divide the size of the source tarball by > 5 if we exclude the parquet files in benchmarks/

Jul 15 '24 12:07 jeromedockes

Failing test (a quick look suggests that it is related to the PR).

Playing with the report, I just had an idea of what I think is a useful functionality that's probably very easy to implement: if there are very similar columns (threshold to be defined, I'd say .9), add in the drop-down menu that selects columns to display an entry named "very similar columns" that would only select columns that have a similarity measure with another column above the threshold.

Jul 15 '24 13:07 GaelVaroquaux

Failing test (a quick look suggests that it is related to the PR).

yes I'm on it :) I made a small change on how labels are rotated to prevent overlap but apparently used something the oldest supported matplotlib doesn't like

Playing with the report, I just had an idea of what I think is a useful functionality that's probably very easy to implement: if there are very similar columns (threshold to be defined, I'd say .9), add in the drop-down menu that selects columns to display an entry named "very similar columns" that would only select columns that have a similarity measure with another column above the threshold.

I like it! Indeed it's just a question of adding a list of those column names in a dictionary somewhere so I'll do it in this PR

Jul 15 '24 14:07 jeromedockes

I like it! Indeed it's just a question of adding a list of those column names in a dictionary somewhere so I'll do it in this PR

Awesome. I definitely see myself using this functionality

Jul 15 '24 14:07 GaelVaroquaux

Awesome. I definitely see myself using this functionality

any suggestion on a short description of those columns for the drop-down? "Columns with high similarity"?

Jul 15 '24 14:07 jeromedockes

"Columns with high similarity"?

:+1:

Jul 15 '24 14:07 GaelVaroquaux