unionml [docs] Create an end-to-end Spark example

Create an example under the tutorials showcasing an end-to-end Spark use case

Aug 17 '22 16:08 cosmicBboy

@peridotml you can #self-assign this to yourself.

Btw, we use a custom docs building process for the tutorials (which I still need to document)

But basically if you create a markdown file (Myst markdown) here you can use the mnist example as a template to work off of.

Steps to write tutorial

follow the contribution guide to set up your dev environment
write the Myst markdown file (read more about it here)
the convert-myst-to-ipynb pre-commit hook should convert it to a docs page (you can run the conversion script manually with python -m scripts.myst_to_ipynb)
run make docs to build docs locally

Aug 17 '22 16:08 cosmicBboy

#self-assign

Aug 24 '22 21:08 peridotml

@cosmicBboy I thought it made sense to first copy the integration tests to see where the nuances of pyspark ml models clashed with unionml's defaults. We don't have to merge the code in, I just thought it was a good place to surface some areas of discussion before writing a specific example.

Here is the PR against my fork. I can change the base to this repo if you think that would be helpful.

I made a couple discoveries:

Good

With a small number of changes, even the FastAPI serving test worked

UnionML / Pyspark compatibility Issues

PySpark ML has different types for initialized and fitted models, which breaks the guardrails.
Pyspark doesn't split the data into features and target, it uses class attributes to know the name of the target column.
There is a necessary step to preprocess features into a single column. I think this could get all figured out with a custom initializer of a pipeline. However, it would require information from the dataset. Let me know if you have thoughts on what to do here!

Pyspark Issues

It looks like unionml is combining the user specified hyper params and defaults. This broke the logistic regression, which might point to an issue on their end. I will investigate this more.
Loading pyspark requires referencing the model class, e.g. LogisticRegressionModel.load, so I am only using PipelineModels. There might be cool way for Flyte's type system to handle cases like this, so let me know if there is and I can update the plugin.

Aug 27 '22 20:08 peridotml

unionml unionml copied to clipboard

[docs] Create an end-to-end Spark example

Steps to write tutorial

Good

UnionML / Pyspark compatibility Issues

Pyspark Issues

unionml
unionml copied to clipboard