yggdrasil-decision-forests
yggdrasil-decision-forests copied to clipboard
Add kernel_method support for Cpp and CLI for slow engine
Draft PR: Add kernel_method Option to RandomForest in YDF with Fallback to Slow Engine
Motivation
This PR adds support for a new boolean configuration option kernel_method in the RandomForest learner of Yggdrasil Decision Forests (YDF). The motivation is to experiment with kernel-based decision tree ensemble behavior. Since this new method is incompatible with the fast inference engine, this PR also implements a fallback to the slow generic engine for prediction and evaluation when kernel_method=true.
About kernel_method
The kernel_method introduces a modified aggregation mechanism within the RandomForest algorithm, inspired by kernel-based ensemble approaches. Instead of computing normalized class distributions at each tree leaf during inference, this method:
- Accumulates raw class counts: Each tree outputs unnormalized class count vectors from selected leaves
- Global normalization: A single normalization step is applied after aggregating counts from all trees
- Proportional weighting: Trees contribute based on their raw count magnitudes, similar to kernel-weighted averaging
This approach differs from standard Random Forest averaging by treating the ensemble more like a kernel method, where the final prediction considers the relative "confidence" (raw counts) of each tree's contribution.
Specifically:
- During prediction, each tree outputs a raw (unnormalized) class count vector from the selected leaf.
- These raw class count vectors are summed across all trees in the forest.
- A single normalization step is then applied to the aggregated vector to produce a final class probability distribution.
This approach ensures that all trees contribute proportionally based on their raw counts, which can stabilize predictions and avoid over-normalization at the leaf level. It is particularly useful in applications requiring more globally coherent probabilistic output.
Theoretical Background
This implementation is inspired by the connection between Random Forests and kernel methods established by Scornet (2015) in "Random forests and kernel methods". The paper demonstrates that Random Forests can be rewritten as kernel estimates (KeRF - Kernel based on Random Forests), where predictions are computed as:
m̃(x) = Σᵢ YᵢK(x,Xᵢ) / Σⱼ K(x,Xⱼ)
where K(x,z) represents the probability that points x and z fall in the same leaf across the forest.
Our kernel_method implements a simplified version of this concept by:
- Accumulating raw class counts from each tree (similar to the kernel weighting concept)
- Performing normalization once at the forest level (rather than at individual tree level)
- Ensuring proportional contribution from all trees based on their raw outputs
Reference: Scornet, E. (2015). Random forests and kernel methods. arXiv preprint arXiv:1502.03836.
Relationship to Prior Work
While our implementation is inspired by Scornet's KeRF framework, it represents a practical approximation:
- KeRF approach: Uses connection functions K(x,z) = P[x and z in same cell] for weighting
- Our approach: Uses raw class counts as proxy weights for tree contributions
- Advantage: Computationally simpler while capturing the spirit of kernel-based aggregation
- Trade-off: Less theoretically rigorous than full KeRF implementation but more practical for large-scale applications
Main Changes
Added functionality
-
Introduced
kernel_methodfield toRandomForestHyperParametersproto and exposed it in CLI training config. -
Modified core training, prediction, and evaluation logic to recognize and handle
kernel_method. -
Disabled fast engine when
kernel_method=trueby:-
Bypassing
engine_or.ok()checks inPredict(),Evaluate(), andshow_modelCLI;
-
Added test
-
KernelMethodSlowEngineOnly: New unit test inrandom_forest_test.ccthat setskernel_method=trueand verifies fallback to slow engine.-
Checks accuracy:
EXPECT_NEAR(metric::Accuracy(evaluation_), 0.850, 0.01) -
Checks logloss:
EXPECT_NEAR(metric::LogLoss(evaluation_), 0.350, 0.04)
-
CLI Behavior Validation
All CLI binaries now work correctly with the --set_random_forest_kernel_method=true flag.
1. train
Example:
./train --dataset=... --model_dir=model_km --set_random_forest_kernel_method=true
Model successfully trained. Output includes kernel_method: true.
2. evaluate
./evaluate --dataset=... --model_dir=model_km
Automatically skips fast engine; fallback to slow evaluation confirmed.
3. predict
./predict --dataset=... --model_dir=model_km --output=...
Prediction uses slow engine, matches expectations, no crashes.
Additional CLI validation:
./show_model --model=model_km
Kernel method is displayed in printed configuration.
Verified Test Results
| Scenario | Fast Engine Used | Accuracy | LogLoss |
|---|---|---|---|
| Baseline (default) | Yes | ≈ 0.86 | ≈ 0.33 |
| kernel_method=true | (fallback to slow engine) | ≈ 0.85 | ≈ 0.35 |
Test: KernelMethodSlowEngineOnly — passed successfully.
Current Limitation
-
kernel method is only used for slow engine for now, when
kernel_method = true, we only use slow generic engine without Google fast engine support.BuildFastEngine()is not modified; instead, fallback logic is added around its usage.
Next Steps
-
Add
kernel_methodinto fast engine and wrap up overall application once slow engine adding is legit -
Consider promoting
kernel_methodsupport into a genericengine_preferenceinterface. -
Add
kernel_methodto python related support so that python ydf can also implement it
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
View this failed invocation of the CLA check for more information.
For the most up to date status, view the checks section at the bottom of the pull request.
Hi, thank you for the PR, this looks very interesting and we'll start reviewing it soon.
Are there any publications associated with the PR or is this new research?
Could you please sign the CLA, so that we know we'll be able to eventually merge this?
@YueWan1 Friendly Ping, are you still pursuing this PR? If so, could you please answer the questions in my previous message?
@YueWan1 Friendly Ping, are you still pursuing this PR? If so, could you please answer the questions in my previous message?
@YueWan1 Friendly Ping, are you still pursuing this PR? If so, could you please answer the questions in my previous message?
yes I'm still pursuing this PR! Sorry I was trying to polish it to a better version and forgot to reply, it implements an existence research method that was introduced in my updated intro docs.
Draft PR: Add
kernel_methodOption to RandomForest in YDF with Fallback to Slow EngineMotivation
This PR adds support for a new boolean configuration option
kernel_methodin the RandomForest learner of Yggdrasil Decision Forests (YDF). The motivation is to experiment with kernel-based decision tree ensemble behavior. Since this new method is incompatible with the fast inference engine, this PR also implements a fallback to the slow generic engine for prediction and evaluation whenkernel_method=true.About
kernel_methodThe
kernel_methodintroduces a modified aggregation mechanism within the RandomForest algorithm, inspired by kernel-based ensemble approaches. Instead of computing normalized class distributions at each tree leaf during inference, this method:
- Accumulates raw class counts: Each tree outputs unnormalized class count vectors from selected leaves
- Global normalization: A single normalization step is applied after aggregating counts from all trees
- Proportional weighting: Trees contribute based on their raw count magnitudes, equivalent to kernel-weighted averaging
This approach differs from standard Random Forest averaging by treating the ensemble more like a kernel method, where the final prediction considers the relative "confidence" (raw counts) of each tree's contribution.
Specifically:
- During prediction, each tree outputs a raw (unnormalized) class count vector from the selected leaf.
- These raw class count vectors are summed across all trees in the forest.
- A single normalization step is then applied to the aggregated vector to produce a final class probability distribution.
This approach ensures that all trees contribute proportionally based on their raw counts, which can stabilize predictions and avoid over-normalization at the leaf level. It is particularly useful in applications requiring more globally coherent probabilistic output.
Theoretical Background
This implementation is inspired by the connection between Random Forests and kernel methods established by Scornet (2015) in "Random forests and kernel methods". The paper demonstrates that Random Forests can be rewritten as kernel estimates (KeRF - Kernel based on Random Forests), where predictions are computed as:
m̃(x) = Σᵢ YᵢK(x,Xᵢ) / Σⱼ K(x,Xⱼ)
where K(x,z) represents the probability that points x and z fall in the same leaf across the forest.
Our
kernel_methodimplements this concept by:
- Accumulating raw class counts from each tree (similar to the kernel weighting concept)
- Performing normalization once at the forest level (rather than at individual tree level)
- Ensuring proportional contribution from all trees based on their raw outputs
- As M approaches infinity, our kernel will have K_{M,n)(x,z) -> K_n(x,z)
Reference: Scornet, E. (2015). Random forests and kernel methods. arXiv preprint arXiv:1502.03836.
Relationship to Prior Work
Our implementation, inspired by Scornet's KeRF framework is a practical implementation from it:
- KeRF approach: Uses connection functions K(x,z) = P[x and z in same cell] for weighting
- Our approach: Uses raw class counts as proxy weights for tree contributions
- Advantage: Computationally accurate while using exactly the sample function with a more efficient implementation
Main Changes
Added functionality
Introduced
kernel_methodfield toRandomForestHyperParametersproto and exposed it in CLI training config.Modified core training, prediction, and evaluation logic to recognize and handle
kernel_method.Disabled fast engine when
kernel_method=trueby:
Bypassing
engine_or.ok()checks inPredict(),Evaluate(), andshow_modelCLI;Added test
KernelMethodSlowEngineOnly: New unit test inrandom_forest_test.ccthat setskernel_method=trueand verifies fallback to slow engine.
Checks accuracy:
EXPECT_NEAR(metric::Accuracy(evaluation_), 0.850, 0.01)Checks logloss:
EXPECT_NEAR(metric::LogLoss(evaluation_), 0.350, 0.04)
CLI Behavior Validation
All CLI binaries now work correctly with the
--set_random_forest_kernel_method=trueflag.1.
trainExample:
./train --dataset=... --model_dir=model_km --set_random_forest_kernel_method=trueModel successfully trained. Output includes
kernel_method: true.2.
evaluate./evaluate --dataset=... --model_dir=model_kmAutomatically skips fast engine; fallback to slow evaluation confirmed.
3.
predict./predict --dataset=... --model_dir=model_km --output=...Prediction uses slow engine, matches expectations, no crashes.
Additional CLI validation:
./show_model --model=model_kmKernel method is displayed in printed configuration.
Verified Test Results
Scenario Fast Engine Used Accuracy LogLoss Baseline (default) Yes ≈ 0.86 ≈ 0.33 kernel_method=true (fallback to slow engine) ≈ 0.85 ≈ 0.35 Test:
KernelMethodSlowEngineOnly— passed successfully.
Current Limitation
kernel method is only used for slow engine for now, when
kernel_method = true, we only use slow generic engine without Google fast engine support.BuildFastEngine()is not modified; instead, fallback logic is added around its usage.
Next Steps
Add
kernel_methodinto fast engine and wrap up overall application once slow engine adding is legitConsider promoting
kernel_methodsupport into a genericengine_preferenceinterface.Add
kernel_methodto python related support so that python ydf can also implement it
60697fd77229e0ded3bc350211ef10ba1bf20f82