[WIP] Add `skrub.Report`
Nitpick: I think that it should be called "TableReport" :)
Nitpick: I think that it should be called "TableReport" :)
good point -- we may want to add other kinds of reports eg on models
Hi Jérôme,
I'm soooo much looking forward to this!
When you are getting close to the review, can you please make sure that we have an example that shows the report. It will help reviewing as well as broadcasting the functionality.
thanks!!
for now I have added one in example 01, we can see it in the circleCI artifact
I think we're not too far from being able to start the reviews, I'm just making a couple of adjustments for old versions of pandas and matplotlib and I still need to add tests for now they are only smoke tests
I have a question: the reports require matplotlib and jinja2; should those be required or optional dependencies? In my experience it is a bit annoying when dependencies for a rather central feature are optional because you have to see the importerror and install the missing package manually, and typical data-science environements will have matplotlib and probably jinja. OTOH most of skrub doesn't need them so it may be bad to ask users who don't need the reports to install them.
It would be nice if the package metadata could distinguish between "required" and "default" dependencies ...
the reports require matplotlib and jinja2; should those be required or optional dependencies?
I guess required
In my experience it is a bit annoying when dependencies for a rather central feature are optional because you have to see the importerror and install the missing package manually, and typical data-science environements will have matplotlib and probably jinja.
Right, but they may not be needed on production servers. Let's worry about this later
With regards to the tooltips, can you make them appear without waiting, as such: https://www.w3schools.com/css/tryit.asp?filename=trycss_tooltip_arrow_bottom
Also, in terms of examples, I don't love that the TableReport clutters a bit an example that is unrelated to it.
How about a new example that's call "Quick start to skrub", that's quite short and shows the TableReport, the tabular_learner, and explains that the tabular_learner creates pipelines with the TableVectorizer ?
With regards to the tooltips, can you make them appear without waiting, as such: https://www.w3schools.com/css/tryit.asp?filename=trycss_tooltip_arrow_bottom
Or maybe you'll need to have them appear below, to stay in the div or the report, as the enclosing environment may hide overflow. This can be done as such https://www.w3schools.com/css/tryit.asp?filename=trycss_tooltip_arrow_top
Note: to try the command-line script you probably need to install skrub again, even if it was installed in editable mode
pip install -e .
skrub-report /my/file.parquet
Another comment (sorry, reviewing by small time lapses): should we add TableReport to the API page?
With regards to the tooltips, can you make them appear without waiting
I made this change but looking a them again I'm not sure they are really useful, we could consider removing them altogether
Can you add section titles to the examples: one section title per feature that you highlight
With regards to the tooltips, can you make them appear without waiting
I made this change but looking a them again I'm not sure they are really useful, we could consider removing them altogether
I think that they are super useful. I would favor keeping them: they help understanding what is going on, but do not clutter
I think that they are super useful. I would favor keeping them: they help understanding what is going on, but do not clutter
ok then, let's keep them!
I wonder whether it would not be useful to add a border separation: the problem that I see is that in the bottom of the object may not be well separated from what follows below. This happens in the documentation:
I see two ways of doing it:
- Full borders all around
.tab-set {
border: 1px solid var(--darkg);
border-top: 0px;
border-radius: 5px;
}
- Border only on the bottom, to be lighter:
/* Element | https://output.circle-artifacts.com/output/job/0776ea43-c78d-4fc9-ae32-14fa1899a4f7/artifacts/0/doc/auto_examples/00_quick_start.html#sphx-glr-auto-examples-00-quick-start-py */
.tab-set {
border-bottom: 1px solid var(--darkg);
}
Maybe the css library that does the tabs also has a way to do this natively, which would probably be preferable.
@jeromedockes : tell us when this is ready for merge
I wonder if this PR has been mixed with other things: are all lines in the diff related to the PR?
I wonder if this PR has been mixed with other things: are all lines in the diff related to the PR?
yes except a few small details (handling a polars deprecation warning that was causing the CI to fail, using a pytest fixture I needed for this PR in 2 other places where it simplified the test). But 2.5k lines of diff are in the pixi.lock due to adding the matplotlib and jinja dependencies
@jeromedockes : tell us when this is ready for merge
ok I think probably today, I'm just adding a few docstrings in the private modules
But 2.5k lines of diff are in the pixi.lock due to adding the matplotlib and jinja dependencies
and another 2k lines in the package-lock.json of the javascript module used to test the reports. As those package jsons are quite big we may want to consider using the MANIFEST.in to exclude them from the package distribution
I'm fine with moving the README to the docs later: I've added an issue here: https://github.com/skrub-data/skrub/pull/984
ok I'll have another look tomorrow but I think I have addressed most comments from @GaelVaroquaux
Do you want to work on your on-line visualizer and add it in the docs, or should we rather merge before (probably a good idea)?
I think we can merge this PR.
If at some point we are happy enough with the online visualizer to add it to the docs we can do another PR. It might be a good idea to wait for the online thing to advertise or release the reports, but not to merge the PR IMO.
I also want to exclude the js_tests/ folder from the distribution
package in a MANIFEST.in (it saves a few kb in the wheel), but i'll do
that in a dedicated PR because there are other files we may want to
exclude -- eg we can divide the size of the source tarball by > 5 if we
exclude the parquet files in benchmarks/
Failing test (a quick look suggests that it is related to the PR).
Playing with the report, I just had an idea of what I think is a useful functionality that's probably very easy to implement: if there are very similar columns (threshold to be defined, I'd say .9), add in the drop-down menu that selects columns to display an entry named "very similar columns" that would only select columns that have a similarity measure with another column above the threshold.
Failing test (a quick look suggests that it is related to the PR).
yes I'm on it :) I made a small change on how labels are rotated to prevent overlap but apparently used something the oldest supported matplotlib doesn't like
Playing with the report, I just had an idea of what I think is a useful functionality that's probably very easy to implement: if there are very similar columns (threshold to be defined, I'd say .9), add in the drop-down menu that selects columns to display an entry named "very similar columns" that would only select columns that have a similarity measure with another column above the threshold.
I like it! Indeed it's just a question of adding a list of those column names in a dictionary somewhere so I'll do it in this PR
I like it! Indeed it's just a question of adding a list of those column names in a dictionary somewhere so I'll do it in this PR
Awesome. I definitely see myself using this functionality
Awesome. I definitely see myself using this functionality
any suggestion on a short description of those columns for the drop-down? "Columns with high similarity"?
"Columns with high similarity"?
:+1: