yggdrasil-decision-forests
yggdrasil-decision-forests copied to clipboard
Simple Model problem
Hi,
I have the following example. How to rewrite the code - so I can use the model Predict for the given input? There is no example in the docs...
#include "yggdrasil_decision_forests/dataset/data_spec.h"
#include "yggdrasil_decision_forests/dataset/data_spec.pb.h"
#include "yggdrasil_decision_forests/dataset/data_spec_inference.h"
#include "yggdrasil_decision_forests/dataset/vertical_dataset_io.h"
#include "yggdrasil_decision_forests/learner/learner_library.h"
#include "yggdrasil_decision_forests/metric/metric.h"
#include "yggdrasil_decision_forests/metric/report.h"
#include "yggdrasil_decision_forests/model/model_library.h"
#include "yggdrasil_decision_forests/utils/filesystem.h"
#include "yggdrasil_decision_forests/utils/logging.h"
#include "yggdrasil_decision_forests/serving/decision_forest/decision_forest.h"
#include <chrono>
namespace ygg = yggdrasil_decision_forests;
int main(int argc, char** argv) {
// Enable the logging. Optional in most cases.
InitLogging(argv[0], &argc, &argv, true);
// Import the model.
LOG(INFO) << "Import the model";
const std::string model_path = "/tmp/my_saved_model/1/assets";
std::unique_ptr<ygg::model::AbstractModel> model;
QCHECK_OK(ygg::model::LoadModel(model_path, &model));
// Show information about the model.
// Like :show_model, but without the list of compatible engines.
std::string model_description;
model->AppendDescriptionAndStatistics(/*full_definition=*/false,
&model_description);
LOG(INFO) << "Model:\n" << model_description;
auto start = std::chrono::high_resolution_clock::now();
// Compile the model for fast inference.
const std::unique_ptr<ygg::serving::FastEngine> serving_engine =
model->BuildFastEngine().value();
const auto& features = serving_engine->features();
// Handle to two features.
const auto age_feature = features.GetNumericalFeatureId("age").value();
const auto sex_feature =
features.GetCategoricalFeatureId("sex").value();
// Allocate a batch of 1 examples.
std::unique_ptr<ygg::serving::AbstractExampleSet> examples =
serving_engine->AllocateExamples(1);
// Set all the values as missing. This is only necessary if you don't set all
// the feature values manually e.g. SetNumerical.
//examples->FillMissing(features);
// Set the value of "age" and "eduction" for the first example.
examples->SetNumerical(/*example_idx=*/0, age_feature, 50.f, features);
examples->SetCategorical(/*example_idx=*/0, sex_feature, "Male",
features);
// Run the predictions on the first two examples.
std::vector<float> batch_of_predictions;
serving_engine->Predict(*examples, 1, &batch_of_predictions);
auto stop = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<milliseconds>(stop - start);
// To get the value of duration use the count()
// member function on the duration object
LOG(INFO) << duration.count();
LOG(INFO) << "Predictions:";
for (const float prediction : batch_of_predictions) {
LOG(INFO) << "\t" << prediction;
}
return 0;
}
Output:
[INFO beginner4.cc:31] Import the model
[INFO beginner4.cc:41] Model:
Type: "RANDOM_FOREST"
Task: CLASSIFICATION
Label: "__LABEL"
Input Features (2):
age
sex
No weights
Variable Importance: MEAN_MIN_DEPTH:
1. "__LABEL" 8.480250 ################
2. "sex" 1.313142 ##
3. "age" 0.000000
Variable Importance: NUM_AS_ROOT:
1. "age" 300.000000
Variable Importance: NUM_NODES:
1. "age" 34255.000000 ################
2. "sex" 1584.000000
Variable Importance: SUM_SCORE:
1. "age" 516361.208020 ################
2. "sex" 148174.885377
Winner take all: true
Out-of-bag evaluation: accuracy:0.756613 logloss:5.68258
Number of trees: 300
Total number of nodes: 71978
Number of nodes by tree:
Count: 300 Average: 239.927 StdDev: 4.94044
Min: 223 Max: 251 Ignored: 0
----------------------------------------------
[ 223, 224) 1 0.33% 0.33%
[ 224, 225) 0 0.00% 0.33%
[ 225, 227) 1 0.33% 0.67%
[ 227, 228) 3 1.00% 1.67% #
[ 228, 230) 3 1.00% 2.67% #
[ 230, 231) 0 0.00% 2.67%
[ 231, 233) 9 3.00% 5.67% ##
[ 233, 234) 17 5.67% 11.33% ###
[ 234, 236) 33 11.00% 22.33% ######
[ 236, 237) 0 0.00% 22.33%
[ 237, 238) 28 9.33% 31.67% #####
[ 238, 240) 51 17.00% 48.67% ##########
[ 240, 241) 0 0.00% 48.67%
[ 241, 243) 46 15.33% 64.00% #########
[ 243, 244) 47 15.67% 79.67% #########
[ 244, 246) 34 11.33% 91.00% #######
[ 246, 247) 0 0.00% 91.00%
[ 247, 249) 13 4.33% 95.33% ###
[ 249, 250) 10 3.33% 98.67% ##
[ 250, 251] 4 1.33% 100.00% #
Depth by leafs:
Count: 36139 Average: 8.48037 StdDev: 2.34049
Min: 3 Max: 15 Ignored: 0
----------------------------------------------
[ 3, 4) 70 0.19% 0.19%
[ 4, 5) 662 1.83% 2.03% #
[ 5, 6) 3633 10.05% 12.08% ######
[ 6, 7) 3980 11.01% 23.09% #######
[ 7, 8) 4520 12.51% 35.60% ########
[ 8, 9) 5143 14.23% 49.83% #########
[ 9, 10) 5817 16.10% 65.93% ##########
[ 10, 11) 5152 14.26% 80.18% #########
[ 11, 12) 3519 9.74% 89.92% ######
[ 12, 13) 2003 5.54% 95.46% ###
[ 13, 14) 975 2.70% 98.16% ##
[ 14, 15) 483 1.34% 99.50% #
[ 15, 15] 182 0.50% 100.00%
Number of training obs by leaf:
Count: 36139 Average: 190.182 StdDev: 179.903
Min: 5 Max: 2370 Ignored: 0
----------------------------------------------
[ 5, 123) 14849 41.09% 41.09% ##########
[ 123, 241) 10680 29.55% 70.64% #######
[ 241, 359) 3917 10.84% 81.48% ###
[ 359, 478) 6234 17.25% 98.73% ####
[ 478, 596) 137 0.38% 99.11%
[ 596, 714) 7 0.02% 99.13%
[ 714, 833) 17 0.05% 99.18%
[ 833, 951) 19 0.05% 99.23%
[ 951, 1069) 6 0.02% 99.24%
[ 1069, 1188) 135 0.37% 99.62%
[ 1188, 1306) 33 0.09% 99.71%
[ 1306, 1424) 0 0.00% 99.71%
[ 1424, 1542) 0 0.00% 99.71%
[ 1542, 1661) 8 0.02% 99.73%
[ 1661, 1779) 53 0.15% 99.88%
[ 1779, 1897) 6 0.02% 99.89%
[ 1897, 2016) 0 0.00% 99.89%
[ 2016, 2134) 2 0.01% 99.90%
[ 2134, 2252) 28 0.08% 99.98%
[ 2252, 2370] 8 0.02% 100.00%
Attribute in nodes:
34255 : age [NUMERICAL]
1584 : sex [CATEGORICAL]
Attribute in nodes with depth <= 0:
300 : age [NUMERICAL]
Attribute in nodes with depth <= 1:
600 : age [NUMERICAL]
300 : sex [CATEGORICAL]
Attribute in nodes with depth <= 2:
1721 : age [NUMERICAL]
379 : sex [CATEGORICAL]
Attribute in nodes with depth <= 3:
3328 : age [NUMERICAL]
1102 : sex [CATEGORICAL]
Attribute in nodes with depth <= 5:
11208 : age [NUMERICAL]
1583 : sex [CATEGORICAL]
Condition type in nodes:
34255 : HigherCondition
1584 : ContainsBitmapCondition
Condition type in nodes with depth <= 0:
300 : HigherCondition
Condition type in nodes with depth <= 1:
600 : HigherCondition
300 : ContainsBitmapCondition
Condition type in nodes with depth <= 2:
1721 : HigherCondition
379 : ContainsBitmapCondition
Condition type in nodes with depth <= 3:
3328 : HigherCondition
1102 : ContainsBitmapCondition
Condition type in nodes with depth <= 5:
11208 : HigherCondition
1583 : ContainsBitmapCondition
Node format: BLOB_SEQUENCE
Training OOB:
trees: 1, Out-of-bag evaluation: accuracy:0.750237 logloss:9.00239
trees: 13, Out-of-bag evaluation: accuracy:0.754722 logloss:7.09704
trees: 23, Out-of-bag evaluation: accuracy:0.753863 logloss:6.3117
trees: 33, Out-of-bag evaluation: accuracy:0.75395 logloss:6.19856
trees: 43, Out-of-bag evaluation: accuracy:0.754299 logloss:6.0429
trees: 53, Out-of-bag evaluation: accuracy:0.753165 logloss:5.9747
trees: 63, Out-of-bag evaluation: accuracy:0.754867 logloss:5.96594
trees: 73, Out-of-bag evaluation: accuracy:0.75443 logloss:5.92934
trees: 83, Out-of-bag evaluation: accuracy:0.754954 logloss:5.92356
trees: 93, Out-of-bag evaluation: accuracy:0.756307 logloss:5.89293
trees: 103, Out-of-bag evaluation: accuracy:0.756569 logloss:5.89233
trees: 113, Out-of-bag evaluation: accuracy:0.755696 logloss:5.81216
trees: 123, Out-of-bag evaluation: accuracy:0.755653 logloss:5.81
trees: 133, Out-of-bag evaluation: accuracy:0.755347 logloss:5.80588
trees: 143, Out-of-bag evaluation: accuracy:0.755914 logloss:5.77847
trees: 153, Out-of-bag evaluation: accuracy:0.755783 logloss:5.77828
trees: 163, Out-of-bag evaluation: accuracy:0.755522 logloss:5.74301
trees: 173, Out-of-bag evaluation: accuracy:0.756264 logloss:5.73794
trees: 183, Out-of-bag evaluation: accuracy:0.756176 logloss:5.7384
trees: 193, Out-of-bag evaluation: accuracy:0.756831 logloss:5.73852
trees: 203, Out-of-bag evaluation: accuracy:0.756613 logloss:5.73565
trees: 213, Out-of-bag evaluation: accuracy:0.757268 logloss:5.73545
trees: 223, Out-of-bag evaluation: accuracy:0.757486 logloss:5.7286
trees: 233, Out-of-bag evaluation: accuracy:0.757093 logloss:5.72556
trees: 243, Out-of-bag evaluation: accuracy:0.757093 logloss:5.7087
trees: 253, Out-of-bag evaluation: accuracy:0.757224 logloss:5.70704
trees: 263, Out-of-bag evaluation: accuracy:0.757006 logloss:5.69758
trees: 273, Out-of-bag evaluation: accuracy:0.756831 logloss:5.69816
trees: 283, Out-of-bag evaluation: accuracy:0.756526 logloss:5.69813
trees: 293, Out-of-bag evaluation: accuracy:0.756526 logloss:5.68217
trees: 300, Out-of-bag evaluation: accuracy:0.756613 logloss:5.68258
[INFO decision_forest.cc:639] Model loaded with 300 root(s), 71978 node(s), and 2 input feature(s).
[INFO abstract_model.cc:1158] Engine "RandomForestOptPred" built
[INFO beginner4.cc:79] 18
[INFO beginner4.cc:81] Predictions:
[INFO beginner4.cc:83] 0.649999
Lets say I want to continuously read the inference input data from stdin - I only need to load model and initialized serving_engine once?
every time stdin has new data to run a prediction I just need to run the following code? does anything change when I change to a http inference - anything to consider regarding thread safety - I cant share the model and serving_engine across multiple threads?
Is there anything I should change when I dont need batching? I still need to use std::unique_ptr<ygg::serving::AbstractExampleSet> examples = serving_engine->AllocateExamples(1)
?
// Handle to two features.
const auto age_feature = features.GetNumericalFeatureId("age").value();
const auto sex_feature =
features.GetCategoricalFeatureId("sex").value();
// Allocate a batch of 1 examples.
std::unique_ptr<ygg::serving::AbstractExampleSet> examples =
serving_engine->AllocateExamples(1);
// Set all the values as missing. This is only necessary if you don't set all
// the feature values manually e.g. SetNumerical.
//examples->FillMissing(features);
// Set the value of "age" and "eduction" for the first example.
examples->SetNumerical(/*example_idx=*/0, age_feature, 50.f, features);
examples->SetCategorical(/*example_idx=*/0, sex_feature, "Male",
features);
// Run the predictions on the first two examples.
std::vector<float> batch_of_predictions;
serving_engine->Predict(*examples, 1, &batch_of_predictions);
auto stop = high_resolution_clock::now();
auto duration = std::chrono::duration_cast<milliseconds>(stop - start);
// To get the value of duration use the count()
// member function on the duration object
LOG(INFO) << duration.count();
LOG(INFO) << "Predictions:";
for (const float prediction : batch_of_predictions) {
LOG(INFO) << "\t" << prediction;
}
I also tried to use the c-api here, but always get as Result Tensor: 1.0? any idea why?
// gcc -I/usr/local/include -L/usr/local/lib main.c -ltensorflow -o main
#include <stdio.h>
#include <tensorflow/c/c_api.h>
void NoOpDeallocator(void* data, size_t a, void* b) {}
int main() {
TF_Graph *Graph = TF_NewGraph();
TF_Status *Status = TF_NewStatus();
TF_SessionOptions *SessionOpts = TF_NewSessionOptions();
TF_Buffer *RunOpts = NULL;
TF_Library *library;
library = TF_LoadLibrary("/usr/local/lib/python3.10/dist-packages/tensorflow_decision_forests/tensorflow/ops/inference/inference.so",
Status);
const char *saved_model_dir = "/tmp/my_saved_model/1/";
const char *tags = "serve";
int ntags = 1;
TF_Session *Session = TF_LoadSessionFromSavedModel(
SessionOpts, RunOpts, saved_model_dir, &tags, ntags, Graph, NULL, Status);
printf("status: %s\n", TF_Message(Status));
if(TF_GetCode(Status) == TF_OK) {
printf("loaded\n");
}else{
printf("not loaded\n");
}
/* Get Input Tensor */
int NumInputs = 2;
TF_Output* Input = malloc(sizeof(TF_Output) * NumInputs);
TF_Output t0 = {TF_GraphOperationByName(Graph, "serving_default_age"), 0};
if(t0.oper == NULL)
printf("ERROR: Failed TF_GraphOperationByName serving_default_input_1\n");
else
printf("TF_GraphOperationByName serving_default_input_1 is OK\n");
TF_Output t1 = {TF_GraphOperationByName(Graph, "serving_default_sex"), 0};
if(t1.oper == NULL)
printf("ERROR: Failed TF_GraphOperationByName serving_default_input_2\n");
else
printf("TF_GraphOperationByName serving_default_input_2 is OK\n");
Input[0] = t0;
Input[1] = t1;
// Get Output tensor
int NumOutputs = 1;
TF_Output* Output = malloc(sizeof(TF_Output) * NumOutputs);
TF_Output tout = {TF_GraphOperationByName(Graph, "StatefulPartitionedCall_1"), 0};
if(tout.oper == NULL)
printf("ERROR: Failed TF_GraphOperationByName StatefulPartitionedCall\n");
else
printf("TF_GraphOperationByName StatefulPartitionedCall is OK\n");
Output[0] = tout;
/* Allocate data for inputs and outputs */
TF_Tensor** InputValues = (TF_Tensor**)malloc(sizeof(TF_Tensor*)*NumInputs);
TF_Tensor** OutputValues = (TF_Tensor**)malloc(sizeof(TF_Tensor*)*NumOutputs);
int ndims = 1;
int64_t dims[] = {1};
int64_t data[] = {50};
int ndata = sizeof(int64_t);
TF_Tensor* int_tensor0 = TF_NewTensor(TF_INT64, dims, ndims, data, ndata, &NoOpDeallocator, 0);
if (int_tensor0 != NULL)
printf("TF_NewTensor is OK\n");
else
printf("ERROR: Failed TF_NewTensor\n");
const char test_string[] = "Male";
TF_TString tstr[1];
TF_TString_Init(&tstr[0]);
TF_TString_Copy(&tstr[0], test_string, sizeof(test_string)-1);
TF_Tensor* int_tensor1 = TF_NewTensor(TF_STRING, NULL, 0, &tstr[0], sizeof(tstr), &NoOpDeallocator, 0);
if (int_tensor1 != NULL)
printf("TF_NewTensor is OK\n");
else
printf("ERROR: Failed TF_NewTensor\n");
InputValues[0] = int_tensor0;
InputValues[1] = int_tensor1;
// Run the Session
TF_SessionRun(Session,
NULL, // Run options.
Input, InputValues, NumInputs, // Input tensors name, input tensor values, number of inputs.
Output, OutputValues, NumOutputs, // Output tensors name, output tensor values, number of outputs.
NULL, 0, // Target operations, number of targets.
NULL,
Status); // Output status.
if(TF_GetCode(Status) == TF_OK)
printf("Session is OK\n");
else
printf("%s",TF_Message(Status));
// Free memory
TF_DeleteGraph(Graph);
TF_DeleteSession(Session, Status);
TF_DeleteSessionOptions(SessionOpts);
TF_DeleteStatus(Status);
/* Get Output Result */
void* buff = TF_TensorData(OutputValues[0]);
float* offsets = (float*)buff;
printf("Result Tensor :\n");
printf("%f\n",offsets[0]);
return 0;
}
Output:
# gcc -I/usr/local/include -L/usr/local/lib main.c -ltensorflow -o main; ./main
2022-06-16 21:05:24.070748: I tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /tmp/my_saved_model/1/
2022-06-16 21:05:24.072148: I tensorflow/cc/saved_model/reader.cc:81] Reading meta graph with tags { serve }
2022-06-16 21:05:24.072208: I tensorflow/cc/saved_model/reader.cc:122] Reading SavedModel debug info (if present) from: /tmp/my_saved_model/1/
2022-06-16 21:05:24.072280: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-16 21:05:24.085806: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-06-16 21:05:24.086985: I tensorflow/cc/saved_model/loader.cc:228] Restoring SavedModel bundle.
2022-06-16 21:05:24.116570: I tensorflow/cc/saved_model/loader.cc:212] Running initialization op on SavedModel bundle at path: /tmp/my_saved_model/1/
[INFO kernel.cc:1176] Loading model from path /tmp/my_saved_model/1/assets/ with prefix cf8326335a66430a
[INFO decision_forest.cc:639] Model loaded with 300 root(s), 71978 node(s), and 2 input feature(s).
[INFO abstract_model.cc:1246] Engine "RandomForestOptPred" built
[INFO kernel.cc:1022] Use fast generic engine
2022-06-16 21:05:24.373058: I tensorflow/cc/saved_model/loader.cc:301] SavedModel load for tags { serve }; Status: success: OK. Took 302321 microseconds.
status:
loaded
TF_GraphOperationByName serving_default_input_1 is OK
TF_GraphOperationByName serving_default_input_2 is OK
TF_GraphOperationByName StatefulPartitionedCall is OK
TF_NewTensor is OK
TF_NewTensor is OK
Session is OK
Result Tensor :
1.000000
Hi,
Let me answer the individual questions.
Lets say I want to continuously read the inference input data from stdin - I only need to load model and initialized serving_engine once?
Yes.
anything to consider regarding thread safety
The inference engine are thread safe.
However, the example buffer (result of AllocateExamples
) are not. You need to create on of them for each thread.
For example, Here is the multi-treaded code used to evaluate a model.
In your case (inference from a stream of data), I would create a StreamProcessor or a logic based on Channels.
I cant share the model and serving_engine across multiple threads?
Yes.
Note that you don't need to keep the model once you created the ending. I.e. release the model.
Is there anything I should change when I dont need batching?
Make sure to allocate AllocateExamples with enought example for your batch.
Is there anything I should change when I dont need batching? I still need to use std::unique_ptrygg::serving::AbstractExampleSet examples = serving_engine->AllocateExamples(1)?
This is correct. Note that you can reuse your example accoss different predictions (instead of reallocating them every time).
I also tried to use the c-api here, but always get as Result Tensor: 1.0? any idea why?
The TF cc api is a bit more complex. Let's try to figure the solution with the YDF API and come back to the TF API if this does not work.
@achoum thanks for all the infos - do you have any examples using StreamProcessor / Channels with YDF c++ API?
I compared different ways to do model inference and checked the latencies:
for the simple model with 2 features - are those measurements expected? Yggdrasil Decision Forests is so much faster... does Yggdrasil speed come from the fast engine vs. tf serving and tf c api uses a slow/generic engine?
TF-DF Python: ~ 50 ms
TF Serving: ~ 20 ms
TF C API: ~ 19 ms
Yggdrasil Decision Forests (YDF) C++: ~ 100 μs
Yggdrasil code does not measure this part - because Compile the model for fast inference only needs to be done once per model:
const std::unique_ptr<ygg::serving::FastEngine> serving_engine =
model->BuildFastEngine().value();
const auto& features = serving_engine->features();
here is what how I measure latency in Yggdrasil:
// Compile the model for fast inference.
const std::unique_ptr<ygg::serving::FastEngine> serving_engine =
model->BuildFastEngine().value();
const auto& features = serving_engine->features();
auto start = high_resolution_clock::now();
// Handle to two features.
const auto age_feature = features.GetNumericalFeatureId("age").value();
const auto sex_feature =
features.GetCategoricalFeatureId("sex").value();
// Allocate a batch of 1 examples.
std::unique_ptr<ygg::serving::AbstractExampleSet> examples =
serving_engine->AllocateExamples(1);
// Set all the values as missing. This is only necessary if you don't set all
// the feature values manually e.g. SetNumerical.
//examples->FillMissing(features);
// Set the value of "age" and "eduction" for the first example.
examples->SetNumerical(/*example_idx=*/0, age_feature, 50.f, features);
examples->SetCategorical(/*example_idx=*/0, sex_feature, "Male",
features);
// Run the predictions on the first two examples.
std::vector<float> batch_of_predictions;
serving_engine->Predict(*examples, 1, &batch_of_predictions);
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
// To get the value of duration use the count()
// member function on the duration object
LOG(INFO) << "duration: " << duration.count();
LOG(INFO) << "Predictions:";
for (const float prediction : batch_of_predictions) {
LOG(INFO) << "\t" << prediction;
}
I also checked the tf serving code. it also uses tf.load_op_library(resource_loader.get_path_to_datafile("inference.so"))
regarding the stream processor - can I use it as follows? any http server lib you recommend? any good way to load the fast_engine directly in the stream processor and pass the modelPath to the stream processor only? If the model has been used before and fast_engine exists - than no need to recreate that fast_engine for it.
std::unique_ptr<ygg::serving::FastEngine> initModelServingEngine(const string &modelPath) {
// Import the model.
LOG(INFO) << "Import the model";
const std::string model_path = modelPath;
std::unique_ptr<ygg::model::AbstractModel> model;
QCHECK_OK(ygg::model::LoadModel(model_path, &model));
// Show information about the model.
// Like :show_model, but without the list of compatible engines.
std::string model_description;
model->AppendDescriptionAndStatistics(/*full_definition=*/false,
&model_description);
LOG(INFO) << "Model:\n" << model_description;
// Compile the model for fast inference.
const std::unique_ptr<ygg::serving::FastEngine> serving_engine = model->BuildFastEngine().value();
return serving_engine;
}
float predict(const std::unique_ptr<ygg::serving::FastEngine> serving_engine, float age, const std::string &sex) {
const auto& features = serving_engine->features();
// Handle to two features.
const auto age_feature = features.GetNumericalFeatureId("age").value();
const auto sex_feature = features.GetCategoricalFeatureId("sex").value();
// Allocate a batch of 1 examples.
std::unique_ptr<ygg::serving::AbstractExampleSet> examples = serving_engine->AllocateExamples(1);
// Set all the values as missing. This is only necessary if you don't set all
// the feature values manually e.g. SetNumerical.
//examples->FillMissing(features);
// Set the value of "age" and "eduction" for the first example.
examples->SetNumerical(/*example_idx=*/0, age_feature, age, features);
examples->SetCategorical(/*example_idx=*/0, sex_feature, sex, features);
// Run the predictions on the first two examples.
std::vector<float> batch_of_predictions;
serving_engine->Predict(*examples, 1, &batch_of_predictions);
float prediction_res = 0.0;
LOG(INFO) << "Predictions:";
for (const float prediction : batch_of_predictions) {
LOG(INFO) << "\t" << prediction;
prediction_res = prediction;
}
return prediction_res;
}
struct features {
float age;
std::string sex;
std::unique_ptr<ygg::serving::FastEngine> fast_engine;
};
int main() {
// create 1 stream process, with multiple threads... in this case 5
features input;
input.age = 55.0f;
input.sex = "Male";
input.fast_engine = initModelServingEngine("/tmp/my_saved_model/1/assets"); // how to deal with multiple models - can I load them in StreamProcessor when they are needed?
using Features = std::unique_ptr<features>;
using Prediction = std::unique_ptr<float>;
StreamProcessor<Features, Prediction > processor("MyPipe", 5,
[](Features x, Prediction prediction) {
return predict(features.age, features.sex, features.fast_engine);
});
processor.StartWorkers();
// when i receive a http request, i submit that data to the processor
// the processor runs the inference and returns the data back
// the returned data is available with processor.GetResult().value()
processor.Submit(absl::make_unique<features>(input));
auto result = processor.GetResult().value();
// create http response with result
thanks for all the infos - do you have any examples using StreamProcessor / Channels with YDF c++ API?
Happy to help :).
Here is an example.
Yggdrasil Decision Forests is so much faster... does Yggdrasil speed come from the fast engine vs. tf serving and tf c api uses a slow/generic engine?
The inference of TF-DF models uses Yggdrasil Decision Forests fast engine underneath. So, all 4 options are equivalent in this regard. However, as you observed, TensorFlow always adds a significant overhead both for training and inference. AFAK the difference of speed comes from both the memory management and amount of executed logic.
If you want fast inference, you need to use Yggdrasil Decision Forests directly.
100µs is already a slow model for Yggdrasil Decision Forests. An inference time of ~1µs is common in my work (of course, it depends on the number of features and complexity of the model). If you care about inference speed, make sure to train a Gradient Boosted Trees (instead of a Random Forest).
Yggdrasil code does not measure this part - because Compile the model for fast inference only needs to be done once per model:
Yes. The model compilation / the creation of the inference engine should not be taken into account in the benchmark as it is only done once. When you load a TF-DF in tensorflow, the engine is also created at loading time.
here is what how I measure latency in Yggdrasil:
Note that there is a model benchmark tool available here. You can use it, or draw inspiration from it.
In your benchmark:
- Don't include "GetNumericalFeatureId" in the benchmark timing. Only do it once when loading the model.
- Don't include "AllocateExamples" in the benchmark timing. Only do it once per thread when loading the model.
- Only call "FillMissing" if you don't already set all the feature manually.
- Don't create "std::vector
batch_of_predictions;" in the benchmark timing. Allocate it once per thread when loading the model. - Make sure to repeat your experiment many times with a for-loop. Also make sure to have a warm-up (i.e. non timed) set of inference.
regarding the stream processor - can I use it as follows?
It depends who your http server works and the number of examples per request. Can you tell me more about your case?
For example, if each request is executed in a separate thread (maybe from a pool of threads) and each request only has a few examples, you don't need a stream processor. However, if you have few concurrent requests, and that each request contains many examples, a stream processor makes sense. Note that you should not create a new processor for each request.
For some time, we had the plan to publish a basic http+grpc model inference server. Tell me if you are interested so we can balance prioritization efficiently.
To be sure: What is the reason for you to use a http server instead of running the model in process? Note that for offline inference, there are better solutions.
Cheers, M.
Hi @achoum
thanks for all the infos.
It depends who your http server works and the number of examples per request. Can you tell me more about your case? To be sure: What is the reason for you to use a http server instead of running the model in process? Note that for offline inference, there are better solutions.
I have strict latency requirements to run the online model inference:
- I have several 1-10 models to run per request
- 1-50 examples (1-50 predictions to run which use 1-10 models) per request
- max allowed latency to run all predictions: 20ms
- current tech stack is Go
There are 3 options I see: Option 1: using Go TF library -> measured inference latency to be very slow / so not an option Option 2: call yggdrasil-decision-forests C++ from Go Option 3: add a C++ model inference server (yggdrasil-decision-forests + http/grpc) for the Go service to communicate over http or GRPC
I tried Option 2 and run into some issues - would like to get your feedback - any idea what goes wrong here? https://github.com/Arnold1/yggdrasil-decision-forests
For some time, we had the plan to publish a basic http+grpc model inference server. Tell me if you are interested so we can balance prioritization efficiently.
I would appreciate if you have something already you can share? I think http server which calls yggdrasil-decision-forests predict c++ code to start with should be enough
Hi Arnold1,
So, in the worst case scenario, you want 50*10 predictions in 20ms. This is 10µs per examples. This is likely possible for a GBT model without multi-threading (it depends on the input features though). For a Random Forest, you will likely need multi-threading or to limit the depth/number of trees.
Regarding your link, the error INVALID_ARGUMENT: Unknown item RANDOM_FOREST in class pool N26yggdrasil_decision_forests5model13AbstractModelE. Registered elements are exit status 1
indicates that the Random Forest model is not linked. Make sure to add a dependency to this rule or to this rule.
Regarding Go, a forth option is to wait for us to release publicly the Go port of Yggdrasil. This is a partial re-implementation in Go of the inference code of Yggdrasil. Our benchmark indicates that this is ~ a 2x slow down.
Regarding the http+grpc model inference server, I don't have anything production read available. This is the GRPC distribute manager. This is significantly more complex than what you need, but it shows how to create a GRPC server. Regarding the serialization format for examples, this is a generic possible solution. Note that example serialization/network/deserialization can take a significant amount of time.
I'll update this post when either the Go port is published or the http+grpc model inference server is available.
The Go port is now available :).
@achoum amazing work, thank you! I think we can close that issue than :)
Btw, let us know if the Go version matches your latency requirements. The Go version doesn't implement the fastest algorithm -- it would be much more complicated to implement (we welcome contributions btw, if you are interested we can send some pointers)-- but still it should be fast.
Plus in Go it's easily parallelizable, you can run each model in a different GoRoutine, no issues.