unionml
unionml copied to clipboard
[docs] Create an end-to-end Spark example
Create an example under the tutorials showcasing an end-to-end Spark use case
@peridotml you can #self-assign
this to yourself.
Btw, we use a custom docs building process for the tutorials (which I still need to document)
But basically if you create a markdown file (Myst markdown) here you can use the mnist example as a template to work off of.
Steps to write tutorial
- follow the contribution guide to set up your dev environment
- write the Myst markdown file (read more about it here)
- the
convert-myst-to-ipynb
pre-commit hook should convert it to a docs page (you can run the conversion script manually withpython -m scripts.myst_to_ipynb
) - run
make docs
to build docs locally
#self-assign
@cosmicBboy I thought it made sense to first copy the integration tests to see where the nuances of pyspark ml models clashed with unionml's defaults. We don't have to merge the code in, I just thought it was a good place to surface some areas of discussion before writing a specific example.
Here is the PR against my fork. I can change the base to this repo if you think that would be helpful.
I made a couple discoveries:
Good
- With a small number of changes, even the FastAPI serving test worked
UnionML / Pyspark compatibility Issues
- PySpark ML has different types for initialized and fitted models, which breaks the guardrails.
- Pyspark doesn't split the data into features and target, it uses class attributes to know the name of the target column.
- There is a necessary step to preprocess features into a single column. I think this could get all figured out with a custom initializer of a pipeline. However, it would require information from the dataset. Let me know if you have thoughts on what to do here!
Pyspark Issues
- It looks like unionml is combining the user specified hyper params and defaults. This broke the logistic regression, which might point to an issue on their end. I will investigate this more.
- Loading pyspark requires referencing the model class, e.g.
LogisticRegressionModel.load
, so I am only using PipelineModels. There might be cool way for Flyte's type system to handle cases like this, so let me know if there is and I can update the plugin.