kedro-viz
kedro-viz copied to clipboard
Enable Kedro-Viz functionality through a notebook, without Kedro Framework.
Description
Make it possible to use Kedro-Viz (pipeline visualisation and experiment tracking) without Kedro framework by using a notebook.
For example I will be able to build a pipeline in notebook and have nodes that output metrics; I will be able to %run_viz and Kedro-Viz will open up with a view of my pipeline and experiments.
Context
Currently, Kedro-Viz is tightly coupled with Kedro framework making it impossible for non-kedro users to use Kedro-Viz. This was highlighted as a pain point in the experiment tracking user research:
"In this case if I really like experiment tracking I might not consider using it if it isn't a kedro project... I am not sure it is a good direction to go with it being completely integrated, especially if there is a new thing like Mlflow"
Secondly, from the non-technical user research https://github.com/kedro-org/kedro-viz/issues/1280 we discovered a group of 'low-code' users that only use notebooks ( e.g. Data Analyst, J. Data Scientist, Researchers). This is a sizeable group (estimated at 70%) within data teams. Providing a notebook access to Kedro-Viz would make it easier for these users to use Kedro-Viz.
What's happening?
If I wanted to use Kedro-Viz in a notebook, without Kedro Framework then this would not be possible. So if I had a setup like this:
my-project
├── my-notebook.ipynb
├── Customer-Churn-Records.csv
├── parameters.yml
├── catalog.yml
└── requirements.txt
Then I’d never be able to see a pipeline visualisation even if, I had:
requirements.txt
kedro==0.18.11
kedro-viz==6.3.3
kedro-datasets[pandas.CSVDataSet]~=1.1
my-notebook.ipynb
from kedro.config import OmegaConfigLoader
from kedro.io import DataCatalog
from kedro.pipeline import Pipeline, node
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from typing import Dict
import logging
import pandas as pd
### Insert something new to load catalog.yml and parameters.yml
def preprocess_data(data: pd.DataFrame) -> pd.DataFrame:
data = data.drop(columns=['RowNumber', 'CustomerId', 'Surname'])
le = LabelEncoder()
data['Gender'] = le.fit_transform(data['Gender'])
data = pd.get_dummies(data, columns=['Geography', 'Card Type'])
return data
def split_data(data: pd.DataFrame, test_size: float, random_state: int) -> Dict:
X = data.drop(columns='Exited')
y = data['Exited']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
return dict(train=(X_train, y_train), test=(X_test, y_test))
def train_model(train: Dict, random_state: int) -> RandomForestClassifier:
X_train, y_train = train['train']
rf_clf = RandomForestClassifier(random_state=random_state)
rf_clf.fit(X_train, y_train)
return rf_clf
def evaluate_model(model: RandomForestClassifier, test: Dict) -> None:
X_test, y_test = test['test']
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
log = logging.getLogger(__name__)
log.info("Model Accuracy: %s", accuracy)
log.info("Confusion Matrix: \n%s", confusion_mat)
log.info("Classification Report: \n%s", class_report)
my_pipeline = pipeline([
node(preprocess_data, "customers", "preprocessed_customers"),
node(split_data, ["preprocessed_customers", "params:test_size", "params:random_state"], "split_data"),
node(train_model, ["split_data", "params:random_state"], "rf_model"),
node(evaluate_model, ["rf_model", "split_data"], None),
])
%run_viz my_pipeline
It should be possible to see the following in another cell in my Jupyter notebook, with the option to open it up in another tab:
Outcome
A user will be able to use Kedro-Viz from a notebook, without the need/setup of a Kedro framework.
Evidence markers
- The comments around the learning curve may suggest that users are not used to working in many files and in different directories
- We will be seeking evidence markers from this in #1448
I love this!
I love this!
What do you love about this? 😄
I think I have two thoughts -
-
This is a neat way of making Kedro Viz useful to people who don't want the complexity of the IDE and may be a stepping stone to getting people into that space.
-
The second point is something I know others have mentioned before - it annoys me that we need to actually load a valid Kedro project with all of it's imports and dependencies just to visualise the pipeline flow. Kedro Viz (in my mind) should load instantly, you shouldn't have to wait for Spark to spin up (especially because you can't run the pipeline anyway). I've long thought Kedro should be able to create a session lazily so you can read the pipeline structure for Viz cheaply without incurring the other costs.
Idea: a kedro-openlineage plugin that emits static OpenLineage metadata events, either in ndjson format or to an HTTP endpoint, which are then consumed by Kedro Viz. This is possible with openlineage-python 1.0, released yesterday.
100000% also lots of LFAI projects there we should deffo do this
This thread on Slack shows a user wanting to merge Viz from 3 different Kedro projects that can't exist side by side since they have conflicting dependencies. Kedro Viz doesn't need to run this, it just needs to visualise the pipeline structure: https://linen-slack.kedro.org/t/14142730/hi-everyone-is-it-possible-to-combine-multiple-kedro-project#d84d8f45-eecc-4c1b-b639-4556c1edcd76
I realised I didn't leave a comment here. I created this last year https://github.com/noklam/kedro-viz-lite. I actually don't remember if I succeed at the end, the logic are mostly in https://github.com/noklam/kedro-viz-lite/blob/main/kedro_viz_lite/core.py.
This lead to my subsequent proposal for the kedro viz build and kedro viz GH page.
My use case for this is explore Pipeline structure, particular when I need to confirm my pipeline works as expected with namespace. The alternative of this is creating a full-blown Kedro project which is a lot of boilerplate. What I care is just the DAGs, and it should be enough as long as I have the DataCatalog and Pipeline. It's also because kedro viz is kind of slow to start up, thus making it hard when I just want to debug quickly. (--reload sometimes just break completely if I have an incomplete Kedro Project)
If this add a bit context, I was writing https://noklam.github.io/blog/posts/understand_namespace/2023-09-26-understand-kedro-namespace-pipeline.html when I think about this.