skrub icon indicating copy to clipboard operation
skrub copied to clipboard

[WIP] Add `skrub.Report`

Open jeromedockes opened this issue 1 year ago • 2 comments

moving skrubview into skrub

jeromedockes avatar Jul 02 '24 10:07 jeromedockes

Nitpick: I think that it should be called "TableReport" :)

GaelVaroquaux avatar Jul 02 '24 12:07 GaelVaroquaux

Nitpick: I think that it should be called "TableReport" :)

good point -- we may want to add other kinds of reports eg on models

jeromedockes avatar Jul 02 '24 13:07 jeromedockes

Hi Jérôme,

I'm soooo much looking forward to this!

When you are getting close to the review, can you please make sure that we have an example that shows the report. It will help reviewing as well as broadcasting the functionality.

GaelVaroquaux avatar Jul 04 '24 15:07 GaelVaroquaux

thanks!!

for now I have added one in example 01, we can see it in the circleCI artifact

I think we're not too far from being able to start the reviews, I'm just making a couple of adjustments for old versions of pandas and matplotlib and I still need to add tests for now they are only smoke tests

jeromedockes avatar Jul 04 '24 15:07 jeromedockes

I have a question: the reports require matplotlib and jinja2; should those be required or optional dependencies? In my experience it is a bit annoying when dependencies for a rather central feature are optional because you have to see the importerror and install the missing package manually, and typical data-science environements will have matplotlib and probably jinja. OTOH most of skrub doesn't need them so it may be bad to ask users who don't need the reports to install them.

It would be nice if the package metadata could distinguish between "required" and "default" dependencies ...

jeromedockes avatar Jul 04 '24 15:07 jeromedockes

the reports require matplotlib and jinja2; should those be required or optional dependencies?

I guess required

In my experience it is a bit annoying when dependencies for a rather central feature are optional because you have to see the importerror and install the missing package manually, and typical data-science environements will have matplotlib and probably jinja.

Right, but they may not be needed on production servers. Let's worry about this later

GaelVaroquaux avatar Jul 04 '24 15:07 GaelVaroquaux

With regards to the tooltips, can you make them appear without waiting, as such: https://www.w3schools.com/css/tryit.asp?filename=trycss_tooltip_arrow_bottom

GaelVaroquaux avatar Jul 09 '24 09:07 GaelVaroquaux

Also, in terms of examples, I don't love that the TableReport clutters a bit an example that is unrelated to it.

How about a new example that's call "Quick start to skrub", that's quite short and shows the TableReport, the tabular_learner, and explains that the tabular_learner creates pipelines with the TableVectorizer ?

GaelVaroquaux avatar Jul 09 '24 09:07 GaelVaroquaux

With regards to the tooltips, can you make them appear without waiting, as such: https://www.w3schools.com/css/tryit.asp?filename=trycss_tooltip_arrow_bottom

Or maybe you'll need to have them appear below, to stay in the div or the report, as the enclosing environment may hide overflow. This can be done as such https://www.w3schools.com/css/tryit.asp?filename=trycss_tooltip_arrow_top

GaelVaroquaux avatar Jul 09 '24 09:07 GaelVaroquaux

Note: to try the command-line script you probably need to install skrub again, even if it was installed in editable mode

pip install -e .
skrub-report /my/file.parquet

jeromedockes avatar Jul 09 '24 09:07 jeromedockes

Another comment (sorry, reviewing by small time lapses): should we add TableReport to the API page?

GaelVaroquaux avatar Jul 09 '24 09:07 GaelVaroquaux

With regards to the tooltips, can you make them appear without waiting

I made this change but looking a them again I'm not sure they are really useful, we could consider removing them altogether

jeromedockes avatar Jul 09 '24 11:07 jeromedockes

Can you add section titles to the examples: one section title per feature that you highlight

GaelVaroquaux avatar Jul 09 '24 13:07 GaelVaroquaux

With regards to the tooltips, can you make them appear without waiting

I made this change but looking a them again I'm not sure they are really useful, we could consider removing them altogether

I think that they are super useful. I would favor keeping them: they help understanding what is going on, but do not clutter

GaelVaroquaux avatar Jul 09 '24 13:07 GaelVaroquaux

I think that they are super useful. I would favor keeping them: they help understanding what is going on, but do not clutter

ok then, let's keep them!

jeromedockes avatar Jul 09 '24 14:07 jeromedockes

I wonder whether it would not be useful to add a border separation: the problem that I see is that in the bottom of the object may not be well separated from what follows below. This happens in the documentation: image

I see two ways of doing it:

  1. Full borders all around image
.tab-set {
  border: 1px solid var(--darkg);
  border-top: 0px;
  border-radius: 5px;
}
  1. Border only on the bottom, to be lighter: image

/* Element | https://output.circle-artifacts.com/output/job/0776ea43-c78d-4fc9-ae32-14fa1899a4f7/artifacts/0/doc/auto_examples/00_quick_start.html#sphx-glr-auto-examples-00-quick-start-py */

.tab-set {
  border-bottom: 1px solid var(--darkg);
}

Maybe the css library that does the tabs also has a way to do this natively, which would probably be preferable.

GaelVaroquaux avatar Jul 09 '24 17:07 GaelVaroquaux

@jeromedockes : tell us when this is ready for merge

GaelVaroquaux avatar Jul 09 '24 20:07 GaelVaroquaux

I wonder if this PR has been mixed with other things: are all lines in the diff related to the PR?

GaelVaroquaux avatar Jul 10 '24 08:07 GaelVaroquaux

I wonder if this PR has been mixed with other things: are all lines in the diff related to the PR?

yes except a few small details (handling a polars deprecation warning that was causing the CI to fail, using a pytest fixture I needed for this PR in 2 other places where it simplified the test). But 2.5k lines of diff are in the pixi.lock due to adding the matplotlib and jinja dependencies

jeromedockes avatar Jul 10 '24 08:07 jeromedockes

@jeromedockes : tell us when this is ready for merge

ok I think probably today, I'm just adding a few docstrings in the private modules

jeromedockes avatar Jul 10 '24 08:07 jeromedockes

But 2.5k lines of diff are in the pixi.lock due to adding the matplotlib and jinja dependencies

and another 2k lines in the package-lock.json of the javascript module used to test the reports. As those package jsons are quite big we may want to consider using the MANIFEST.in to exclude them from the package distribution

jeromedockes avatar Jul 10 '24 08:07 jeromedockes

I'm fine with moving the README to the docs later: I've added an issue here: https://github.com/skrub-data/skrub/pull/984

GaelVaroquaux avatar Jul 10 '24 20:07 GaelVaroquaux

ok I'll have another look tomorrow but I think I have addressed most comments from @GaelVaroquaux

jeromedockes avatar Jul 11 '24 15:07 jeromedockes

Do you want to work on your on-line visualizer and add it in the docs, or should we rather merge before (probably a good idea)?

GaelVaroquaux avatar Jul 15 '24 11:07 GaelVaroquaux

I think we can merge this PR.

If at some point we are happy enough with the online visualizer to add it to the docs we can do another PR. It might be a good idea to wait for the online thing to advertise or release the reports, but not to merge the PR IMO.

I also want to exclude the js_tests/ folder from the distribution package in a MANIFEST.in (it saves a few kb in the wheel), but i'll do that in a dedicated PR because there are other files we may want to exclude -- eg we can divide the size of the source tarball by > 5 if we exclude the parquet files in benchmarks/

jeromedockes avatar Jul 15 '24 12:07 jeromedockes

Failing test (a quick look suggests that it is related to the PR).

Playing with the report, I just had an idea of what I think is a useful functionality that's probably very easy to implement: if there are very similar columns (threshold to be defined, I'd say .9), add in the drop-down menu that selects columns to display an entry named "very similar columns" that would only select columns that have a similarity measure with another column above the threshold.

GaelVaroquaux avatar Jul 15 '24 13:07 GaelVaroquaux

Failing test (a quick look suggests that it is related to the PR).

yes I'm on it :) I made a small change on how labels are rotated to prevent overlap but apparently used something the oldest supported matplotlib doesn't like

Playing with the report, I just had an idea of what I think is a useful functionality that's probably very easy to implement: if there are very similar columns (threshold to be defined, I'd say .9), add in the drop-down menu that selects columns to display an entry named "very similar columns" that would only select columns that have a similarity measure with another column above the threshold.

I like it! Indeed it's just a question of adding a list of those column names in a dictionary somewhere so I'll do it in this PR

jeromedockes avatar Jul 15 '24 14:07 jeromedockes

I like it! Indeed it's just a question of adding a list of those column names in a dictionary somewhere so I'll do it in this PR

Awesome. I definitely see myself using this functionality

GaelVaroquaux avatar Jul 15 '24 14:07 GaelVaroquaux

Awesome. I definitely see myself using this functionality

any suggestion on a short description of those columns for the drop-down? "Columns with high similarity"?

jeromedockes avatar Jul 15 '24 14:07 jeromedockes

"Columns with high similarity"?

:+1:

GaelVaroquaux avatar Jul 15 '24 14:07 GaelVaroquaux