mosaic Data Profiling UI

Develop a web-based data profiling UI where analysts can drag and drop a CSV/Parquet file or connect to a publicly accessible URL and get interactive column profiles. For each column, the UI should show a histogram/bar chart/time series chart depending on the data type. Support interactive cross linking between all charts. Persist the state of the application in the URL so that one can easily share a link to a specific insight. The UI could be extended with support for custom – potentially multi-dimensional – charts (via writing vgplot YAML/JSON). Make the UI embeddable so it could be used in a Jupyter notebook with https://anywidget.dev/.

May 23 '24 14:05 domoritz

working on this issue

May 28 '24 03:05 JessamineQ

Inspo: https://motherduck.com/blog/introducing-column-explorer/

(They ended up using my class' homework: https://colab.research.google.com/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb)

May 28 '24 12:05 jaanli

Another inspiration: https://github.com/cmudig/AutoProfiler which is based on some designs from https://www.rilldata.com.

May 28 '24 16:05 domoritz

@JessamineQ is working on this in https://github.com/cmudig/mosaic-profiler.

Jun 07 '24 15:06 domoritz

Please take a look at https://perspective.finos.org/, it has a very nice table and chart components for quickly building interactive data analysis

Jun 16 '24 10:06 aszenz

https://github.com/manzt/quak by @manzt is a really cool table with histograms and it's a profiling UI built on top of Mosaic.

Jul 26 '24 11:07 domoritz

These are amazing @domoritz !

This is ironic, but Motherduck.com has featured the data I hastily made for a class at Princeton last year:

https://motherduck.com/blog/introducing-column-explorer/

it’s in all of their onboarding flows on the web-based user interface.

Unfortunately it is closed-source, so now we have some great tools to make the open source version happen for our work at @onefact exploring hospital prices :)

if anyone is curious here’s the first week’s homework for the data thinking class I taught last year:

https://colab.research.google.com/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb

we are needing to profile data at 2 petabyte scale now, so if anyone has done this and has ideas we are very open for collaboration and contribution (and sponsorship / public benchmarks for graph query planning algorithms - discussing with neo4j, kuzu, AWS Neptune, and @duckdb teams!).

Also happy to add anyone to our slack if you email me! I will be live streaming some data profiling UI work next week.

Jul 26 '24 14:07 jaanli

I'm closing this now that https://github.com/cmudig/mosaic-profiler and https://github.com/manzt/quak exist. There is definitely space for more tools and improvements but don't need to track it here anymore.

Sep 01 '24 16:09 domoritz