mleap
mleap copied to clipboard
Add support for spark-extensions transformers in Pyspark
Creating this parent issue to track progress on adding support for spark-extensions transformers in Pyspark.
- [x] Support for StringMap
- [x] Support for MathBinary
- [x] Support for MathUnary
- [ ] Support for WordLengthFilter
- [ ] Support for MultinomialLabeler
- [ ] Support for Imputer
- [ ] Support for OneVsRest
- [ ] Support for SupportVectorMachine
- [x] Add unit test support for Pyspark transformers
Related issues:
- https://github.com/combust/mleap/issues/551
- https://github.com/combust/mleap/issues/552
- https://github.com/combust/mleap/issues/477
- https://github.com/combust/mleap/issues/495
There are currently two examples of custom transformers in Pyspark under https://github.com/combust/mleap/pull/568.
how long will it take?
Add unit test support for Pyspark transformers
I have added most of the building blocks for that here: https://github.com/combust/mleap/pull/571
MathUnary implementation at: https://github.com/combust/mleap/pull/666
MathBinary implementation: https://github.com/combust/mleap/pull/672
can we also add VectorSizeHint (A feature transformer)? my model has a stage for VectorSizeHint and it failed during mleap serialization.
can we also add VectorSizeHint (A feature transformer)?
Seems reasonable. @RuxuePeng do you have any interest in submitting a PR? I'm happy to help you review it and merge 😄
Hi Is there a better way to serialize a custom transformer with Python now? According to https://github.com/combust/mleap-docs/issues/15 , there's a lot of work to do for someone being unfamiliar with Scala. Is there any update? Thanks.
@johnnyasd12 the pyspark portions of custom transformers are usually quite simple since pyspark is just a py4j wrapper around the "real" spark scala code. But as you correctly identified, custom transformers requires that you write scala code.
Needing to write scala is a fundamental part of mleap's design, and is probably never going to change. Reducing the amount of scala code needed is something that myself and some others have discussed, but it is not being actively developed. So e.g., needing to write fewer files and having more base classes which can implement the other files. But mleap support for python udfs is not something I would expect to ever happen since mleap runtime is JVM based.
@jsleight thanks for giving insight into this topic! For our coperate data science project we also need custom transformers in pyspark for mleap. Are there any updates on this since 2021? Its would definetly be a very welcomed feature!
Don't think there's been any updates w.r.t. simplifying the process of making custom transformers.