yggdrasil-decision-forests icon indicating copy to clipboard operation
yggdrasil-decision-forests copied to clipboard

Add kernel_method support for Cpp and CLI for slow engine

Open YueWan1 opened this issue 5 months ago • 4 comments


Draft PR: Add kernel_method Option to RandomForest in YDF with Fallback to Slow Engine

Motivation

This PR adds support for a new boolean configuration option kernel_method in the RandomForest learner of Yggdrasil Decision Forests (YDF). The motivation is to experiment with kernel-based decision tree ensemble behavior. Since this new method is incompatible with the fast inference engine, this PR also implements a fallback to the slow generic engine for prediction and evaluation when kernel_method=true.

About kernel_method

The kernel_method introduces a modified aggregation mechanism within the RandomForest algorithm, inspired by kernel-based ensemble approaches. Instead of computing normalized class distributions at each tree leaf during inference, this method:

  • Accumulates raw class counts: Each tree outputs unnormalized class count vectors from selected leaves
  • Global normalization: A single normalization step is applied after aggregating counts from all trees
  • Proportional weighting: Trees contribute based on their raw count magnitudes, similar to kernel-weighted averaging

This approach differs from standard Random Forest averaging by treating the ensemble more like a kernel method, where the final prediction considers the relative "confidence" (raw counts) of each tree's contribution.

Specifically:

  • During prediction, each tree outputs a raw (unnormalized) class count vector from the selected leaf.
  • These raw class count vectors are summed across all trees in the forest.
  • A single normalization step is then applied to the aggregated vector to produce a final class probability distribution.

This approach ensures that all trees contribute proportionally based on their raw counts, which can stabilize predictions and avoid over-normalization at the leaf level. It is particularly useful in applications requiring more globally coherent probabilistic output.

Theoretical Background

This implementation is inspired by the connection between Random Forests and kernel methods established by Scornet (2015) in "Random forests and kernel methods". The paper demonstrates that Random Forests can be rewritten as kernel estimates (KeRF - Kernel based on Random Forests), where predictions are computed as:

m̃(x) = Σᵢ YᵢK(x,Xᵢ) / Σⱼ K(x,Xⱼ)

where K(x,z) represents the probability that points x and z fall in the same leaf across the forest.

Our kernel_method implements a simplified version of this concept by:

  • Accumulating raw class counts from each tree (similar to the kernel weighting concept)
  • Performing normalization once at the forest level (rather than at individual tree level)
  • Ensuring proportional contribution from all trees based on their raw outputs

Reference: Scornet, E. (2015). Random forests and kernel methods. arXiv preprint arXiv:1502.03836.

Relationship to Prior Work

While our implementation is inspired by Scornet's KeRF framework, it represents a practical approximation:

  • KeRF approach: Uses connection functions K(x,z) = P[x and z in same cell] for weighting
  • Our approach: Uses raw class counts as proxy weights for tree contributions
  • Advantage: Computationally simpler while capturing the spirit of kernel-based aggregation
  • Trade-off: Less theoretically rigorous than full KeRF implementation but more practical for large-scale applications

Main Changes

Added functionality

  • Introduced kernel_method field to RandomForestHyperParameters proto and exposed it in CLI training config.

  • Modified core training, prediction, and evaluation logic to recognize and handle kernel_method.

  • Disabled fast engine when kernel_method=true by:

    • Bypassing engine_or.ok() checks in Predict(), Evaluate(), and show_model CLI;

Added test

  • KernelMethodSlowEngineOnly: New unit test in random_forest_test.cc that sets kernel_method=true and verifies fallback to slow engine.

    • Checks accuracy: EXPECT_NEAR(metric::Accuracy(evaluation_), 0.850, 0.01)

    • Checks logloss: EXPECT_NEAR(metric::LogLoss(evaluation_), 0.350, 0.04)


CLI Behavior Validation

All CLI binaries now work correctly with the --set_random_forest_kernel_method=true flag.

1. train

Example:

./train --dataset=... --model_dir=model_km --set_random_forest_kernel_method=true

Model successfully trained. Output includes kernel_method: true.

2. evaluate

./evaluate --dataset=... --model_dir=model_km

Automatically skips fast engine; fallback to slow evaluation confirmed.

3. predict

./predict --dataset=... --model_dir=model_km --output=...

Prediction uses slow engine, matches expectations, no crashes.

Additional CLI validation:

./show_model --model=model_km

Kernel method is displayed in printed configuration.


Verified Test Results

Scenario Fast Engine Used Accuracy LogLoss
Baseline (default) Yes ≈ 0.86 ≈ 0.33
kernel_method=true (fallback to slow engine) ≈ 0.85 ≈ 0.35

Test: KernelMethodSlowEngineOnly — passed successfully.


Current Limitation

  • kernel method is only used for slow engine for now, when kernel_method = true, we only use slow generic engine without Google fast engine support. BuildFastEngine() is not modified; instead, fallback logic is added around its usage.


Next Steps

  • Add kernel_method into fast engine and wrap up overall application once slow engine adding is legit

  • Consider promoting kernel_method support into a generic engine_preference interface.

  • Add kernel_method to python related support so that python ydf can also implement it


YueWan1 avatar Jun 09 '25 18:06 YueWan1

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

google-cla[bot] avatar Jun 09 '25 18:06 google-cla[bot]

Hi, thank you for the PR, this looks very interesting and we'll start reviewing it soon.

Are there any publications associated with the PR or is this new research?

Could you please sign the CLA, so that we know we'll be able to eventually merge this?

rstz avatar Jun 12 '25 14:06 rstz

@YueWan1 Friendly Ping, are you still pursuing this PR? If so, could you please answer the questions in my previous message?

rstz avatar Jun 25 '25 07:06 rstz

@YueWan1 Friendly Ping, are you still pursuing this PR? If so, could you please answer the questions in my previous message?

@YueWan1 Friendly Ping, are you still pursuing this PR? If so, could you please answer the questions in my previous message?

yes I'm still pursuing this PR! Sorry I was trying to polish it to a better version and forgot to reply, it implements an existence research method that was introduced in my updated intro docs.

YueWan1 avatar Jun 25 '25 19:06 YueWan1


Draft PR: Add kernel_method Option to RandomForest in YDF with Fallback to Slow Engine

Motivation

This PR adds support for a new boolean configuration option kernel_method in the RandomForest learner of Yggdrasil Decision Forests (YDF). The motivation is to experiment with kernel-based decision tree ensemble behavior. Since this new method is incompatible with the fast inference engine, this PR also implements a fallback to the slow generic engine for prediction and evaluation when kernel_method=true.

About kernel_method

The kernel_method introduces a modified aggregation mechanism within the RandomForest algorithm, inspired by kernel-based ensemble approaches. Instead of computing normalized class distributions at each tree leaf during inference, this method:

  • Accumulates raw class counts: Each tree outputs unnormalized class count vectors from selected leaves
  • Global normalization: A single normalization step is applied after aggregating counts from all trees
  • Proportional weighting: Trees contribute based on their raw count magnitudes, equivalent to kernel-weighted averaging

This approach differs from standard Random Forest averaging by treating the ensemble more like a kernel method, where the final prediction considers the relative "confidence" (raw counts) of each tree's contribution.

Specifically:

  • During prediction, each tree outputs a raw (unnormalized) class count vector from the selected leaf.
  • These raw class count vectors are summed across all trees in the forest.
  • A single normalization step is then applied to the aggregated vector to produce a final class probability distribution.

This approach ensures that all trees contribute proportionally based on their raw counts, which can stabilize predictions and avoid over-normalization at the leaf level. It is particularly useful in applications requiring more globally coherent probabilistic output.

Theoretical Background

This implementation is inspired by the connection between Random Forests and kernel methods established by Scornet (2015) in "Random forests and kernel methods". The paper demonstrates that Random Forests can be rewritten as kernel estimates (KeRF - Kernel based on Random Forests), where predictions are computed as:

m̃(x) = Σᵢ YᵢK(x,Xᵢ) / Σⱼ K(x,Xⱼ)

where K(x,z) represents the probability that points x and z fall in the same leaf across the forest.

Our kernel_method implements this concept by:

  • Accumulating raw class counts from each tree (similar to the kernel weighting concept)
  • Performing normalization once at the forest level (rather than at individual tree level)
  • Ensuring proportional contribution from all trees based on their raw outputs
  • As M approaches infinity, our kernel will have K_{M,n)(x,z) -> K_n(x,z)

Reference: Scornet, E. (2015). Random forests and kernel methods. arXiv preprint arXiv:1502.03836.

Relationship to Prior Work

Our implementation, inspired by Scornet's KeRF framework is a practical implementation from it:

  • KeRF approach: Uses connection functions K(x,z) = P[x and z in same cell] for weighting
  • Our approach: Uses raw class counts as proxy weights for tree contributions
  • Advantage: Computationally accurate while using exactly the sample function with a more efficient implementation

Main Changes

Added functionality

  • Introduced kernel_method field to RandomForestHyperParameters proto and exposed it in CLI training config.

  • Modified core training, prediction, and evaluation logic to recognize and handle kernel_method.

  • Disabled fast engine when kernel_method=true by:

    • Bypassing engine_or.ok() checks in Predict(), Evaluate(), and show_model CLI;

Added test

  • KernelMethodSlowEngineOnly: New unit test in random_forest_test.cc that sets kernel_method=true and verifies fallback to slow engine.

    • Checks accuracy: EXPECT_NEAR(metric::Accuracy(evaluation_), 0.850, 0.01)

    • Checks logloss: EXPECT_NEAR(metric::LogLoss(evaluation_), 0.350, 0.04)


CLI Behavior Validation

All CLI binaries now work correctly with the --set_random_forest_kernel_method=true flag.

1. train

Example:

./train --dataset=... --model_dir=model_km --set_random_forest_kernel_method=true

Model successfully trained. Output includes kernel_method: true.

2. evaluate

./evaluate --dataset=... --model_dir=model_km

Automatically skips fast engine; fallback to slow evaluation confirmed.

3. predict

./predict --dataset=... --model_dir=model_km --output=...

Prediction uses slow engine, matches expectations, no crashes.

Additional CLI validation:

./show_model --model=model_km

Kernel method is displayed in printed configuration.


Verified Test Results

Scenario Fast Engine Used Accuracy LogLoss
Baseline (default) Yes ≈ 0.86 ≈ 0.33
kernel_method=true (fallback to slow engine) ≈ 0.85 ≈ 0.35

Test: KernelMethodSlowEngineOnly — passed successfully.


Current Limitation

  • kernel method is only used for slow engine for now, when kernel_method = true, we only use slow generic engine without Google fast engine support. BuildFastEngine() is not modified; instead, fallback logic is added around its usage.


Next Steps

  • Add kernel_method into fast engine and wrap up overall application once slow engine adding is legit

  • Consider promoting kernel_method support into a generic engine_preference interface.

  • Add kernel_method to python related support so that python ydf can also implement it


60697fd77229e0ded3bc350211ef10ba1bf20f82

R4M4NCHICK1 avatar Aug 05 '25 23:08 R4M4NCHICK1