mleap icon indicating copy to clipboard operation
mleap copied to clipboard

Standardize Serialization format with Spark

Open hollinwilkins opened this issue 8 years ago • 0 comments

Standardizing ML Pipeline Serialization

Currently there is a large array of serialization formats for machine learning models:

  1. PMML is an XML-based format primarily targeting the JVM for executing ML models
  2. Scikit-learn relies on Python pickling to export models
  3. Spark has a serialization format based on Parquet and JSON
  4. Various other libraries such as Caffe, Torch, MLDB, etc. have their own custom file formats they use to store models with

We propose a serialization format that is highly-extensible, portable across language and platforms, open-source and with a reference implementation in both Scala and Rust. We call this serialization format Bundle.ML.

Key Features

  1. It should be easy for developers to add custom transformers in Scala, Java, Python, C, Rust, or any other language
  2. The serialization format should be flexible and meet state-of-the-art performance requirements. This means being able to serialize arbitrarily-large random forest, linear, or neural network models.
  3. Serialization should be optimized for ML Transformers and Pipelines as seen in Scikit-learn and Spark, but it should also support non-pipeline based frameworks such as H2O
  4. Serialization should be accessible for all environments and platforms, including low-level languages like C, C++ and Rust
  5. Provide a common, extensible serialization format for any technology to integrate with via custom transformers or core transformers
  6. Serialization/Deserialization should be possible with as many technologies as possible to make the models truly portable between different platforms. ie, we should be able to train a pipeline with Scikit-learn then execute it in Spark.

hollinwilkins avatar Feb 06 '17 17:02 hollinwilkins