facets
facets copied to clipboard
How to make Facets Overview less laggy on Jupyter notebook?
Hey guys, I'm loving the functionality of facets overview.
However I tried using it on my company's data on a jupyter notebook and everything is very very laggy. I've set --NotebookApp.iopub_data_rate_limit=1.0e11
but it's still very slow. The training set is 150K rows by 100 columns and the test set is 25K rows by 100 columns.
Any tips on how to make things less laggy? Or are these larger datasets the current limit of facets on jupyter? I still have a lot of RAM available but facets doesn't seem to be using it.
Where does the lagginess come about? Is it in scrolling through the Overview tables?
Can you try cutting the number of columns to see how that affects the speed? Separately you could try cutting the number of examples in the training set and see what that does for lagginess. These will be good data points for figuring out what is the root cause.
Yes the lagginess comes from interacting with the Overview tables, so not just scrolling but selecting dropdowns and checkboxes. It also slows down the rest of the notebook when scrolling or selecting paragraphs. The notebook returns to normal performance once I stop displaying Overview.
I've tried reducing the training set to 20K rows by 100 columns but things are still laggy. I'll also try reducing the number of columns, but at this point Overview will become less useful for ranking problematic features.
Assuming reducing the columns speeds it back up, I'll do some profiling of the performance and see what we can do to speed it up when there's a large number of columns. Let me know how much that affects performance for you.
Thanks that would be great!
I tried reducing the number of columns. For these tests the training set has 20K rows and test set has 25K rows.
- Reducing to 50 columns still produces large lagginess
- Reducing to 25 columns, Overview is less laggy but still hard to use interactively
- Reducing to 20 columns there is still a noticeable lag but Overview can be used interactively
- Reducing to 15 columns performance comes close to that of the Overview demo jupyter notebook, which is pretty smooth. In the demo there is 15 columns as well.
Note that most of the columns are numeric.
Upon first profiling, the issue seems to be that the mouse events (like scrolling, or even clicking the sort-by menu) is causing each plottable.js chart in the tables (possibly including ones no longer being rendered due to being off-screen/not scrolled to) to have _eventToProcessingFunction called, which leads to _measureAndDispatch and computePosition.
Need to figure out two things: a) why any interaction is causing the charts to recompute position b) why that recomputation is happening for charts that aren't rendered
I repro'd this by running "bazel run facets_overview/functional_tests/stress:devserver", going to localhost:6006/facets-overview/functional-tests/simple/index.html and then clicking around while using the chrome profiling tools.
Hi,
Seems that exactly the same story with me as with @ernestchancivitas (no, we are not working in the same office J). I wonder if there is a way to use Facets Overview without Jupyter and whether it will be helpfull?..
I wonder if this issue is being addressed?
For those interested : I am rendering over 600 columns, and the facets-overview protobuf objects are constructed using Map-Reduce, and fetched from S3. I had to implement pagination on the python side, and it seems to resolve the issue.
I actually made a fix that should help with this months ago (see https://github.com/PAIR-code/facets/commit/6c5c0174a2650aaba64c83f87b3ec620431f65b0#diff-a06690add7f8ba3f9836226f414d19e6) but failed to update the compiled facets-dist/facets-jupyter.html on master with the fix.
Now I have an updated dist file on branch build_fix but need to test it before pulling it to master.
@sjang92 would you mind pulling down the build_fix branch and reinstalling the jupyter extension from that branch (make sure the new facets-jupyter.html overwrites the old one you have in your extension), and let me know if this helps with performance with large numbers of columns (and also if it is displaying correctly in general)?
@jameswex Thank you! I can see that the fix drastically improves performance. However, in my case some of the string features have very high cardinality (hashed features for instance), which slows down rendering due to large size in valueFreq / histogram buckets I believe. Have you done any performance testing regarding this? I'm wondering if there's some kind of performance benchmark categorized by number of columns and their cardinalities. But of course, since this is a more feature-space specific problem, for now as a quick solution I'm preventing these features from being rendered together with other features.
What is the cardinality of those string features? We usually cap the number of RankHistogram buckets for any feature at 1000 when generating feature stats to avoid having too much to render when there isn't much valuable information to the histogram in that case anyways.
Perhaps the visualization itself could guard against this by only displaying up to some max number of buckets for those CDF charts.
For these kind of features, I'd want to prevent them from being loaded to facets-overview in the first place. I decided to trim the sorted list of buckets / values and frequencies to 1000 for now. FYI, this happened because we had some use cases where we wanted to visualize some raw data as well. Thank you so much for your help!
@sjang92, @ernestchancivitas & @arassadin
I'm Mahima Pushkarna, the design lead on Facets. As we continue to improve Facets, I would like to know more about if and how you used Facets. Would you be open to individually sharing your experiences over a short video chat?
Thanks! Mahima