OpenEmbedding
                                
                                 OpenEmbedding copied to clipboard
                                
                                    OpenEmbedding copied to clipboard
                            
                            
                            
                        OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.
OpenEmbedding
English version | 中文版
About
OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration.
Nowadays, many machine learning and deep learning applications are built based on parameter servers, which are used to efficiently store and update model weights. When a model has a large number of sparse features (e.g., Wide&Deep and DeepFM for CTR prediction), the number of weights easily runs into billions to trillions. In such a case, the tradition synchronization solutions (such as the Allreduce-based solution adopted by Horovod) are unable to achieve high-performance because of massive communication overhead introduced by a tremendous number of sparse features. In order to achieve efficiency for such sparse models, we develop OpenEmbedding, which enhances the parameter server especially for the sparse model training and inference.
Highlights
Efficiency
- We propose an efficient customized sparse format to handle sparse features. Together with our fine-grained optimization, such as cache-conscious algorithms, asynchronous cache read and write, and lightweight locks to maximize parallelism. OpenEmbedding is able to achieve the performance speedup of 3-8x compared with the Allreduce-based solution on a single machine equipped with 8 GPUs for sparse model training.
Ease-of-use
- We have integrated OpenEmbedding into Tensorflow. Only three lines of code changes are required to utilize OpenEmbedding in Tensorflow for both training and inference.
Adaptability
- In addition to Tensorflow, it is straightforward to integrate OpenEmbedding into other popular frameworks. We have demonstrated the integration with DeepCTR and Horovod in the examples.
Benchmark

For models that contain sparse features, it is difficult to speed up using the Allreduce-based framework Horovod. Using both OpenEmbedding and Horovod can get better acceleration effects. In the single 8 GPU scene, the speedup ratio is 3 to 8 times. Many models achieved 3 to 7 times the performance of Horovod.
- Benchmark
Install & Quick Start
You can install and run OpenEmbedding by the following steps. The examples show the whole process of training criteo data with OpenEmbedding and predicting with Tensorflow Serving.
Docker
NVIDIA docker is required to use GPU in image. The OpenEmbedding image can be obtained from Docker Hub.
# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
#    "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
#    "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
#    "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
docker run --rm --gpus all -v /tmp/criteo:/openembedding/tmp/criteo \
    4pdosc/openembedding:latest examples/run/criteo_deepctr_standalone.sh 
# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
        -v /tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5
# Send requests and get predict results.
docker run --rm --network host 4pdosc/openembedding:latest examples/run/criteo_deepctr_restful.sh
# Clear docker.
docker stop serving-example
docker rm serving-example
Ubuntu
# Install the dependencies required by OpenEmbedding.
apt update && apt install -y gcc-7 g++-7 python3 libpython3-dev python3-pip
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding
# Install the dependencies required by examples.
apt install -y git cmake mpich curl 
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py
# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding
# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
#    "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
#    "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
#    "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
examples/run/criteo_deepctr_standalone.sh 
# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
        -v `pwd`/tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5
# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh
# Clear docker.
docker stop serving-example
docker rm serving-example
CentOS
# Install the dependencies required by OpenEmbedding.
yum install -y centos-release-scl
yum install -y python3 python3-devel devtoolset-7
scl enable devtoolset-7 bash
pip3 install --upgrade pip
pip3 install tensorflow==2.5.1
pip3 install openembedding
# Install the dependencies required by examples.
yum install -y git cmake mpich curl 
HOROVOD_WITHOUT_MPI=1 pip3 install horovod
pip3 install deepctr pandas scikit-learn mpi4py
# Download the examples.
git clone https://github.com/4paradigm/OpenEmbedding.git
cd OpenEmbedding
# The script "criteo_deepctr_stanalone.sh" will train and export the model to the path "tmp/criteo/1".
# It is okay to switch to:
#    "criteo_deepctr_horovod.sh" (multi-GPU training with Horovod),
#    "criteo_deepctr_mirrored.sh" (multi-GPU training with MirroredStrategy),
#    "criteo_deepctr_mpi.sh" (multi-GPU training with MultiWorkerMirroredStrategy and MPI).
examples/run/criteo_deepctr_standalone.sh 
# Start TensorFlow Serving to load the trained model.
docker run --name serving-example -td -p 8500:8500 -p 8501:8501 \
        -v `pwd`/tmp/criteo:/models/criteo -e MODEL_NAME=criteo tensorflow/serving:latest
# Wait the model server start.
sleep 5
# Send requests and get predict results.
examples/run/criteo_deepctr_restful.sh
# Clear docker.
docker stop serving-example
docker rm serving-example
Note
The installation usually requires g++ 7 or higher, or a compiler compatible with tf.version.COMPILER_VERSION. The compiler can be specified by environment variable CC and CXX. Currently OpenEmbedding can only be installed on linux.
CC=gcc CXX=g++ pip3 install openembedding
If TensorFlow was updated, you need to reinstall OpenEmbedding.
pip3 uninstall openembedding && pip3 install --no-cache-dir openembedding
User Guide
A sample program for common usage is as follows.
Create Model and Optimizer.
import tensorflow as tf
import deepctr.models import WDL
optimizer = tf.keras.optimizers.Adam()
model = WDL(feature_columns, feature_columns, task='binary')
Transform to distributed Model and distributed Optimizer. The Embedding layer will be stored on the parameter server.
import horovod as hvd
import openembedding.tensorflow as embed
hvd.init()
optimizer = embed.distributed_optimizer(optimizer)
optimizer = hvd.DistributedOptimizer(optimizer)
model = embed.distributed_model(model)
Here, embed.distributed_optimizer is used to convert the TensorFlow optimizer into an optimizer that supports the parameter server, so that the parameters on the parameter server can be updated. The function embed.distributed_model is to replace the Embedding layers in the model and override the methods to support saving and loading with parameter servers. Method Embedding.call will pull the parameters from the parameter server and the backpropagation function was registered to push the gradients to the parameter server.
Data parallelism by Horovod.
model.compile(optimizer, "binary_crossentropy", metrics=['AUC'],
              experimental_run_tf_function=False)
callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0),
              hvd.callbacks.MetricAverageCallback() ]
model.fit(dataset, epochs=10, verbose=2, callbacks=callbacks)
Export as a stand-alone SavedModel so that can be loaded by TensorFlow Serving.
if hvd.rank() == 0:
    # Must specify include_optimizer=False explicitly
    model.save_as_original_model('model_path', include_optimizer=False)
More examples as follows.
- Replace Embeddinglayer
- Transform network model
- Custom subclass model
- With MirroredStrategy
- With MultiWorkerMirroredStrategy and MPI
Build
Docker Build
docker build -t 4pdosc/openembedding-base:0.1.0 -f docker/Dockerfile.base .
docker build -t 4pdosc/openembedding:0.0.0-build -f docker/Dockerfile.build .
Native Build
The compiler needs to be compatible with tf.version.COMPILER_VERSION (>= 7), and install all prpc dependencies to tools or /usr/local, and then run build.sh to complete the compilation. The build.sh will automatically install prpc (pico-core) and parameter-server (pico-ps) to the tools directory.
git submodule update --init --checkout --recursive
pip3 install tensorflow
./build.sh clean && ./build.sh build
pip3 install ./build/openembedding-*.tar.gz
Features
TensorFlow 2
- dtype:- float32,- float64.
- tensorflow.keras.initializers- RandomNormal,- RandomUniform,- Constant,- Zeros,- Ones.
- The parameter seedis currently ignored.
 
- tensorflow.keras.optimizers- Adadelta,- Adagrad,- Adam,- Adamax,- Ftrl,- RMSprop,- SGD.
- decayand- LearningRateScheduleare not supported.
- Adam(amsgrad=True)is not supported.
- RMSProp(centered=True)is not supported.
- The parameter server uses a sparse update method, which may cause different training results for the Optimizerwith momentum.
 
- tensorflow.keras.layers.Embedding- Support array for known input_dimand hash table for unknowninput_dim(2**63 range).
- Can still be stored on workers and use dense update method.
- Should not use embeddings_regularizer,embeddings_constraint.
 
- Support array for known 
- tensorflow.keras.Model- Can be converted to distributed Modeland automatically ignore or convert incompatible settings (such asembeddings_constraint).
- Distributed save,save_weights,load_weightsandModelCheckpoint.
- Saving the distributed Modelas a stand-alone SavedModel, which can be load by TensorFlow Serving.
- Do not support training multiple distributed Models in one task.
 
- Can be converted to distributed 
- Can collaborate with Horovod. Training with MirroredStrategyorMultiWorkerMirroredStrategyis experimental.
TODO
- Improve performance
- Support PyTorch training
- Support tf.feature_column.embedding_column
- Approximate embedding_regularizer,LearningRateScheduleand etc.
- Improve the support for InitializerandOptimizer
- Training multiple distributed Models in one task
- Support ONNX
Designs
- Training
- Serving
Authors
- Yiming Liu ([email protected])
- Yilin Wang ([email protected])
- Cheng Chen ([email protected])
- Guangchuan Shi ([email protected])
- Zhao Zheng ([email protected])
Persistent Memory (PMem)
Currently, the interface for persistent memory is experimental. PMem-based OpenEmbedding provides a lightweight checkpointing scheme as well as the comparable performance with its DRAM version. For long-running deep learning recommendation model training, PMem-based OpenEmbedding provides not only an efficient but also a reliable training process.
- PMem-based OpenEmbedding
Publications
- OpenEmbedding: A Distributed Parameter Server for Deep Learning Recommendation Models using Persistent Memory. Cheng Chen, Yilin Wang, Jun Yang, Yiming Liu, Mian Lu, Zhao Zheng, Bingsheng He, Weng-Fai Wong, Liang You, Penghao Sun, Yuping Zhao, Fenghua Hu, and Andy Rudoff. In 2023 IEEE 39rd International Conference on Data Engineering (ICDE) 2023.