hamilton
hamilton copied to clipboard
🚧 Adds MLflow materializer
🚧 WIP 🚧
Changes
How I tested this
Notes
Checklist
- [ ] PR has an informative and human-readable title (this will be pulled into the release notes)
- [ ] Changes are limited to a single goal (no scope creep)
- [ ] Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
- [ ] Any change in functionality is tested
- [ ] New functions are documented (with a description, list of inputs, and expected output)
- [ ] Placeholder code is flagged / future TODOs are captured in comments
- [ ] Project documentation has been updated if adding/changing functionality.
how to save a model into mlflow: https://mlflow.org/docs/latest/quickstart.html#store-a-model-in-mlflow how to load a model from mlflow: https://mlflow.org/docs/latest/quickstart.html#load-a-model-from-a-specific-training-run-for-inference
model flavors can be found here or below (but missing crate?)
>>> import mlflow
>>> mlflow.__version__
'2.7.1'
>>> [attr for attr in dir(mlflow) if hasattr(getattr(mlflow, attr), 'log_model')]
[
'catboost', 'diviner', 'fastai', 'gluon', 'h2o', 'johnsnowlabs', 'langchain',
'lightgbm', 'mleap', 'onnx', 'openai', 'paddle', 'pmdarima', 'prophet',
'pyfunc', 'pytorch', 'sentence_transformers', 'sklearn', 'spacy', 'spark',
'statsmodels', 'tensorflow', 'transformers', 'xgboost'
]
top three flavors (probably): sklearn, tensorflow, pytorch. no hard data, just vibes.
example of load/save flow for sklearn model flavor from mlflow quickstart.
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import mlflow
from mlflow.models import infer_signature
run_id = None
db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)
with mlflow.start_run() as run:
rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
rf.fit(X_train, y_train)
save_predictions = rf.predict(X_test)
signature = infer_signature(X_test, save_predictions)
mlflow.sklearn.log_model(rf, "model", signature=signature)
run_id = run.info.run_id
model = mlflow.sklearn.load_model(f"runs:/{run_id}")
load_predictions = model.predict(X_test)
assert save_predictions == load_predictions
disclaimer: I have not tested this
Experimental flavors, so we can probably prioritize these last?
@bryangalindo we should come up with the Hamilton UX to help guide this. i.e. what's the API we want to expose for Hamilton?
@bryangalindo we should come up with the Hamilton UX to help guide this. i.e. what's the API we want to expose for Hamilton?
Ok let's chat during our sync. Thanks!
High-level tasks:
Analysis:
- [ ] (1 hour) Create "hello, world!" version of load_model/log_model to understand mlflow (debug, print stmts, etc).
- [ ] (30 min) Observe directories/files created from log_model (see
hamilton/plugins/mlruns/0/0b9e9b23c3ef443ba638d23e4318b58e
) - [x] (3 hours) Get high-level understanding of Hamilton driver, see
hamilton/driver.py
. - [ ] (3 hours) Get high-level understanding of e.g., regressors
- [ ] (1 hour) Read through files in
examples/materialization
- [x] (15 min) Decide on what reader/writer type makes sense (e.g., MLflowRegressorReader/MLflowRegressorWriter)
- [ ] (15 min) Decide on the applicable type (e.g., dataframe, classifiers, regressors)
- [ ] (30 min) Decide what metadata to save from model (see
hamilton/plugins/mlruns/0/0b9e9b23c3ef443ba638d23e4318b58e/artifacts/model
) - [x] (15 min) Discover kwargs for log_model and load_model .
Reader/Writer Development:
- [ ] (2 hours) Write reader
- [ ] (2 hours) Write writer
- [ ] (1 hour) Write get metadata function
- [ ] (1 hour) Write unit tests for get metadata function
- [ ] (1 hour) Write unit tests for reader/writer
Materializer Development:
- [ ] (30 min) Write data loader module (see
examples/materialization/data_loaders.py
) - [ ] (1 hour) Write model_training module (see
examples/materialization/model_training.py
) - [ ] (1 hour) Write run.py module (see
examples/materialization/run.py
) - [ ] (1 hour) Write jupyter notebook example (see
examples/materialization/notebook.ipynb
) - [ ] (5 min) Write requirements.txt
Hey @bryangalindo -- a thought on a feature that might be helpful. Here's an outline of what the API should look like -- the data saver/materialization implementatino should support this.
from hamilton.function_modifiers import source
dr = driver.Driver(...)
dr.materialize(
to.mlflow(
id="mlflow_save",
dependencies=["my_cool_model"],
model_input=source("training_data"),
model_output=source("predictions"),
)
)
Then the materializer would call infer_signature
with the model_input
and model_output
-- these would be taken from nodes called training_data
and predictions
. The DAG would look like:
training_data -> mlflow_save
predictions -> mlflow_save
and possibly more connections. Does this make sense? This is all supported btw -- materializers can take in source
/value
type parameters -- if they take in one that isn't it, they'll just resolve to a value
.
Closed in favor of #945