mleap
mleap copied to clipboard
Standardize Serialization format with Spark
Standardizing ML Pipeline Serialization
Currently there is a large array of serialization formats for machine learning models:
- PMML is an XML-based format primarily targeting the JVM for executing ML models
- Scikit-learn relies on Python pickling to export models
- Spark has a serialization format based on Parquet and JSON
- Various other libraries such as Caffe, Torch, MLDB, etc. have their own custom file formats they use to store models with
We propose a serialization format that is highly-extensible, portable across language and platforms, open-source and with a reference implementation in both Scala and Rust. We call this serialization format Bundle.ML.
Key Features
- It should be easy for developers to add
custom transformers
in Scala, Java, Python, C, Rust, or any other language - The serialization format should be flexible and meet state-of-the-art performance requirements. This means being able to serialize arbitrarily-large random forest, linear, or neural network models.
- Serialization should be optimized for ML Transformers and Pipelines as seen in Scikit-learn and Spark, but it should also support non-pipeline based frameworks such as H2O
- Serialization should be accessible for all environments and platforms, including low-level languages like C, C++ and Rust
- Provide a common, extensible serialization format for any technology to integrate with via custom transformers or core transformers
- Serialization/Deserialization should be possible with as many technologies as possible to make the models truly portable between different platforms. ie, we should be able to train a pipeline with Scikit-learn then execute it in Spark.