kedro-viz
kedro-viz copied to clipboard
Kedro-Viz to show preview of data
Description
Kedro-viz supports Plotly. Plotly has cool tables -https://plotly.com/python/table/
the idea is simply show the first 5/10 rows of the dataset on Kedro-viz
Implementation
Since we already support Plotly, this would be easy to do, we just read the first 5 rows from the data and display it as a table.
There is an argument around loading so many datasets might make kedro-viz slow. But loading only happens when metadata panel is clicked which is one dataset at a time. Also maybe on Kedro we can allow users to specify which datasets they want to preview on Kedro-viz using catalog.yml preview = true
Would love this!
One note on implementation - we need a workflow to avoid opening enormous files for no reason.
- The situation I'm worried about is specifically
pandas.CSVDataSetbeing 1 begillion rows long and us loading that for 5 rows of data. - For
spark.SparkDataSetwe can append a.limit(5)on there to avoid this.
@datajoely I think we should add an optional head API to Kedro Dataset if we were to do this. This allows viz to preview beyond pandas or spark and avoid performance bottleneck. The thing that knows how to optimise head is the dataset implementation, not viz.
Yeah agreed
I like this idea and have thought about similar schemes in the past. So since you've brought it up here, let me dump some thoughts I had before here also...
Two basic questions:
- is plotly the right thing to use for this? It's a good option since we have it already available, but maybe there's better libraries out there for handling tables (e.g. doesn't look like plotly would handle many hundreds of columns well? Which is not at all uncommon in a kedro pipeline)
- how general should we make this? As per @limdauto's comment, maybe we have a general
headmethod that can be used for any dataset. Could we incorporate the current behaviour for matplotlib and plotly datasets into this more generic mechanism? Going beyond a dataset preview, what if I don't want to show the first n rows but would rather just show the size of the dataframe (rows and columns) in the metadata side panel? (which seems equally useful to me and maybe more practical for large dataframes)
Just using plotly for pandas and/or spark dataframes would be totally great for an MVP and to get user feedback, but I just want to brainstorm how we might want to make this more generic in the longer term.
The question of adding custom properties to datasets comes up quite a bit, e.g. https://github.com/kedro-org/kedro-viz/issues/662 (put number of rows in dataset on kedro-viz), https://github.com/quantumblacklabs/private-kedro/issues/1148 (add metadata to catalog entries than can be consumed by plugins), https://github.com/kedro-org/kedro/issues/1076 (very long-standing issue on how to add metadata to catalog entries). This is not just limited to kedro-viz but there's a more general kedro question of how to attach metadata to a catalog entry. Let me just focus on the kedro-viz question here though.
https://github.com/kedro-org/kedro-viz/issues/662#issuecomment-984506222 spells out my rough idea for this: user-customisable dataset widgets. This is quite similar to the idea of kedro-viz extensions, only:
- these widgets are shown in the metadata panel rather than a whole new screen (which has both pros and cons but basically means there's much more limited space for them)
- widgets are lighter weight and more restricted in how they must be written (unlike an extension, it doesn't start its own server etc.)
As a user, I might want to keep track of lots of different things about a dataset: number of rows/columns, number of unique entries in a particular column, number of N/As, etc. Enabling something that visualises the number of rows in a dataset of type
pandas.*is just one particular example of this - in reality I might like to track any sort of thing for any sort of dataset. Let me call this a "trackable".In the future I think there should be two possible methods for this:
- via experiment tracking - this is already work in progress. You can write code to calculate whatever trackable you like in a node and then save it to a
trackingdataset. Crucially this will give you a sense of how the trackable changes between one kedro run and the next, since I should be able to go back in time and visualise the pipeline and datasets of historic runs.- some kind of customisable "widget" which allows me to give, in the catalog, as many trackables as I like, e.g. (completely made up example syntax)
shuttles: type: pandas.CSVDataSet filepath: ... viz_widgets: number_of_rows number_of_na: column1, column2, column3 my_custom_widgetWhere we supply with kedro viz a few common widgets like
number_of_rows, but a user can define their ownmy_custom_widgetalso so it's very flexible. The natural place for this information to be shown on kedro viz would be the side panel on the right hand side that appears when you click on a dataset. But it would be super cool if somehow we could make the pipeline visualisation customisable with user-pluggable widgets too.
According to this scheme, previewing the first 5 rows of a dataset would be some kind of dataframe_head: {rows: 5} widget that we provide within kedro-viz. This could even be automatically applied to all the datasets of the right type. There could be some kind of marketplace for user-defined widgets (small javascript apps I guess?).
Is the idea of a marketplace of custom widgets for kedro-viz datasets a huge overkill for this? At the moment, absolutely yes. We could achieve what @rashidakanchwala's describes much more simply. And at the moment I think kedro-viz extensions would be better to work on than dataset widgets. But I think it's worth thinking about where this might end up in future though, since it might spark other people's ideas and potentially affects design decisions up front. e.g.
Also maybe on Kedro we can allow users to specify which datasets they want to preview on Kedro-viz using catalog.yml preview = true
This seems too ad-hoc and hacky to me, like the current implementation of layer which is a dataset property but only really used by kedro-viz. So if we end up with lots of such parameters I think we should consider exactly where they should live so that catalog entries don't become too bloated.
The exploration for seeing dataset statistics by @GabrielComymQB:

Notes from Technical Design session:
The team discussed a possible solution to preview data in Viz both on the metadata panel and the experiment tracking panel.
Some questions raised around the goal of showing a preview:
- Do we want to show just a preview of the data, or perhaps insights (e.g. # of columns, mean, median..)?
- Should users be able to customise what is shown in such a preview?
The consensus is that just a blanket preview of showing the first 5-10 rows wouldn't be useful with all data, and thus the preview should be customisable.
Possible solution:
The solution discussed in the meeting is adding a _preview() method to datasets that specifies how data should be displayed on the Viz side. This _preview() method will be customisable so if a user doesn't like the default implementation they can override it to suit their needs. The result will be displayed in the metadata and experiment tracking panels.
A downside of this solution is that we would essentially be adding visualisation specific code to the framework side, blurring the boundaries between Kedro Viz and Kedro Framework. But the _preview() method could be useful in a jupyter flow as well.
Follow up questions/actions:
- [ ] What types of data would the
_preview()method return? What are the optimal types to display data in Viz? - [ ] Specifically, users have expressed the need to log CSV data, what do they want to see from this CSV data?
- [ ] Are there any other solutions, perhaps with more of the heavy lifting on the Viz side, that would solve this issue?
A few more thoughts on the preview method approach. Let's say that we solve the question of what types of data preview can return (shouldn't be too hard) and are happy with this living on kedro framework as a new dataset method (I'm more sceptical here). Here's a possibly representative example of what someone might want to do:
- for some
pandas.CSVDataSets in their pipeline, show number of rows - for some other
pandas.CSVDataSets in their pipeline, show first 5 rows
The simplest way to implement this would be for the user to write two new sorts of dataset, something like this:
class CSVDataSetWithNumberOfRows(pandas.CSVDataSet):
def preview():
return len(self._load())
class CSVDataSetWithHead(pandas.CSVDataSet):
def preview():
return self._load().head()
Then in the catalog file you need to change the relevant dataset type from pandas.CSVDataSet to path.to.CSVDataSetWithNumberOfRows and path.to.CSVDataSetWithHead.
This seems quite unsatisfactory:
- it feels heavy-handed to require a new dataset class just to alter how preview renders in kedro-viz. The
load/savebehaviour of the dataset is what really matters in kedro, and that's the same for all these classes - it doesn't scale well: even if you want every
pandas.CSVDataSetto preview the same way, you have to change thetypefor all your catalog entries (might eventually be solved by improvements to kedro config system)
Fundamentally I think the problem here is that datasets are not easily composed. I cannot easily "mix in" a new behaviour without creating a whole new class. @limdauto mentioned once that Dmitrii had prototyped some new component-based dataset architecture that looks more like my widgets example above. This might be a major change to how kedro datasets work though, which I don't think is on the cards for the foreseeable future.
In reality, is this a problem? Possibly not; maybe we just hard code a sensible default preview into pandas.CSVDataSet and only a few advanced users who are happy writing custom classes would even think of trying to change this. If we value a user being able to customise the preview behaviour then a dataset preview method does feel awkward to me though.
Problem is, I'm not sure I have a better alternative... Maybe hooks + a viz.yml config file somehow? Certainly this would keep the functionality on the kedro-viz side much more. Let me ponder this and write it up as an alternative proposal.
I think [tool.kedro.viz] pyproject.toml section would be helpful you know. In fact, everything in the settings modal could be pre-defined there?
Hi team,
I was thinking maybe the _preview method can be in Viz as it is a viz implementation. And within the Kedro project catalog.yml we define it like below so the Viz knows how/what to handle for different datasets?
feature_engineering_output: type: pandas.CSVDataSet filepath: ${base_location}/04_feature/feature_importance_output.csv layer: feature preview : >>enable: true >>showRows : 5
@MerelTheisenQB , @datajoely , @tynandebold , @idanov
What about adding preview logic to the AbstractDataSet class? And then also implementing it for the pandas and spark datasets today?
pandas -> .head(5)
spark -> .limit(5).toPandas().head()
Notes from Technical Design session:
- We'll go with the use of transcoding and the @preview symbol to denote in the catalog that this dataset will be both a normal dataset and have a preview attached to it.
- In the Viz UI we'll only load the data on click when the metadata panel is rendered
A question: what icon would we have for a node with a data preview inside it?
- We need to come up with a different way to show that this dataset has more information
- If a dataset has multiple pieces of information, the icon could have some layers if there are multiple things to show
I was going to create an implementation ticket for the above but wanted to know the below
are we going to use Plotly for this or are we going to let design create the look & feel and then decide which charting library we will use for the tables based on it?
Does the Plotly JS have a table type that we can use here? I'm not sure they do.
If not, we could just create an HTML table ourselves first and then pair with design to polish it off. Or are you thinking we need something more complex than just a static table?
It does, but there are better ones out there which support things I think our users would like filtering, excel export, text search etc.
...Question for you all if it's worth adding the extra library overhead beyond Plotly.
That first link is for Python though. We would need the JS version, which they do have, though it doesn't really give you much.
I think we get a basic implementation in there first and then add functionality later.
Closing this ticket as design and implementation work for the feature is mentioned on ticket #1136
Update - I had a discussion with @merelcht , the preview function will be written on Kedro side. We are unsure if it's only preview, or also we share the metadata information about (number of rows/columns etc)
I am reponening this ticket as front-end design is done but there's still on going discussions around implementation
This work will touch Kedro datasets as well as the backend and frontend of Viz.
The first dataset we should add a preview method to is pandas.CSVDataSet.
For the frontend work, the design was done in #1136, so check there for reference.