Standardize Serialization format with Spark

Open hollinwilkins opened this issue 8 years ago • 0 comments

Standardizing ML Pipeline Serialization

Currently there is a large array of serialization formats for machine learning models:

PMML is an XML-based format primarily targeting the JVM for executing ML models
Scikit-learn relies on Python pickling to export models
Spark has a serialization format based on Parquet and JSON
Various other libraries such as Caffe, Torch, MLDB, etc. have their own custom file formats they use to store models with

We propose a serialization format that is highly-extensible, portable across language and platforms, open-source and with a reference implementation in both Scala and Rust. We call this serialization format Bundle.ML.

Key Features

It should be easy for developers to add custom transformers in Scala, Java, Python, C, Rust, or any other language
The serialization format should be flexible and meet state-of-the-art performance requirements. This means being able to serialize arbitrarily-large random forest, linear, or neural network models.
Serialization should be optimized for ML Transformers and Pipelines as seen in Scikit-learn and Spark, but it should also support non-pipeline based frameworks such as H2O
Serialization should be accessible for all environments and platforms, including low-level languages like C, C++ and Rust
Provide a common, extensible serialization format for any technology to integrate with via custom transformers or core transformers
Serialization/Deserialization should be possible with as many technologies as possible to make the models truly portable between different platforms. ie, we should be able to train a pipeline with Scikit-learn then execute it in Spark.

Feb 06 '17 17:02 hollinwilkins

mleap mleap copied to clipboard

Standardize Serialization format with Spark

Standardizing ML Pipeline Serialization

Key Features

mleap
mleap copied to clipboard