Data Profiling UI
Develop a web-based data profiling UI where analysts can drag and drop a CSV/Parquet file or connect to a publicly accessible URL and get interactive column profiles. For each column, the UI should show a histogram/bar chart/time series chart depending on the data type. Support interactive cross linking between all charts. Persist the state of the application in the URL so that one can easily share a link to a specific insight. The UI could be extended with support for custom – potentially multi-dimensional – charts (via writing vgplot YAML/JSON). Make the UI embeddable so it could be used in a Jupyter notebook with https://anywidget.dev/.
working on this issue
Inspo: https://motherduck.com/blog/introducing-column-explorer/
(They ended up using my class' homework: https://colab.research.google.com/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb)
Another inspiration: https://github.com/cmudig/AutoProfiler which is based on some designs from https://www.rilldata.com.
@JessamineQ is working on this in https://github.com/cmudig/mosaic-profiler.
Please take a look at https://perspective.finos.org/, it has a very nice table and chart components for quickly building interactive data analysis
https://github.com/manzt/quak by @manzt is a really cool table with histograms and it's a profiling UI built on top of Mosaic.
These are amazing @domoritz !
This is ironic, but Motherduck.com has featured the data I hastily made for a class at Princeton last year:
https://motherduck.com/blog/introducing-column-explorer/
it’s in all of their onboarding flows on the web-based user interface.
Unfortunately it is closed-source, so now we have some great tools to make the open source version happen for our work at @onefact exploring hospital prices :)
if anyone is curious here’s the first week’s homework for the data thinking class I taught last year:
https://colab.research.google.com/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb
we are needing to profile data at 2 petabyte scale now, so if anyone has done this and has ideas we are very open for collaboration and contribution (and sponsorship / public benchmarks for graph query planning algorithms - discussing with neo4j, kuzu, AWS Neptune, and @duckdb teams!).
Also happy to add anyone to our slack if you email me! I will be live streaming some data profiling UI work next week.
I'm closing this now that https://github.com/cmudig/mosaic-profiler and https://github.com/manzt/quak exist. There is definitely space for more tools and improvements but don't need to track it here anymore.