decision-forests icon indicating copy to clipboard operation
decision-forests copied to clipboard

Decision path through node thresholds

Open tlapusan opened this issue 2 years ago • 11 comments

Hi,

I would like to ask you about the decision node path in case the node has a categorical or numerical threshold.

What I observed is when I have a categorical node and the threshold condition is met, then the path is going to the left. cat_1 cat_2

If the threshold is a numerical one and the condition is met, then the path is going to the right. num_1 num_2

Is this behavior the intended one ?

tlapusan avatar May 27 '22 07:05 tlapusan

Hi, this doesn't look correct to me. Do you have a repro for this?

rstz avatar May 30 '22 09:05 rstz

Hi @rstz,

You can check this colab notebook: https://colab.research.google.com/drive/1XvsafToHzDQVR5BOOKEVCxsny_P1pRBU?usp=sharing

tlapusan avatar May 31 '22 07:05 tlapusan

Thank you, I'll have a look

rstz avatar May 31 '22 08:05 rstz

Minor update: This looks like a bug in the way integerized values are treated by TF-DF. In your example Sex_label == 1 corresponds to "male" in the dataset, but it corresponds to "female" when inspecting (and drawing) the graph. We are working on a fix.

Small aside: TF-DF supports string categories, so there is no need to convert strings to integers.

rstz avatar May 31 '22 13:05 rstz

good to know about string categories, I had the data preprocessing step from other model libraries and used it as it is :)

tlapusan avatar May 31 '22 14:05 tlapusan

@rstz @tlapusan is there any disadvantage to use TF-DF string categories compared to converting the string categories into an integer using Target Encoding or One-Hot Encoding? when is Target Encoding or One-Hot Encoding useful with decision trees if so or does it have any benefits? what happens if I have two similar websites one has www.cnn.com and cnn.com - would TF-DF string category map to the same? what happens if number of categories are high?

how does TF-DF supports string categories internally? how does Yggdrasil Random Forests handle it if there are large number of high cardinality categorical variables?

Arnold1 avatar Jun 18 '22 23:06 Arnold1

Is there any disadvantage to use TF-DF string categories compared to converting the string categories into an integer using Target Encoding or One-Hot Encoding?

Generally, using one-hot encoding with decision forests makes the model larger and less accurate model than using other options. For this reason, it is not recommended.

From experience, and depending on the dataset, target encoding is complementary to CART or RANDOM categorical splits.

what happens if I have two similar websites one has www.cnn.com and cnn.com - would TF-DF string category map to the same?

Both CART and RANDOM splitter learn conditions of the type "attribute in mask". If this is supported by the dataset, they can learn splits such as "site in [www.cnn.com, cnn.com]. On the other hand, one hot encoding can only check on categorical values at a time.

what happens if the number of categories are high?

There is a risk of overfitting. In this case, RANDOM categorical splits, target propagation, regularization or more advanced techniques are required.

how does TF-DF supports string categories internally?

It depends on the semantic of the attribute. Here is the list of the supported semantics. For a categorical attribute, as mentioned above, three splitter algorithms are available: CART, RANDOM and ONE_HOT. For a categorical set attribute (e.g. a bag of works), another algorithm is used.

achoum avatar Jul 04 '22 16:07 achoum

Minor update: This looks like a bug in the way integerized values are treated by TF-DF. In your example Sex_label == 1 corresponds to "male" in the dataset, but it corresponds to "female" when inspecting (and drawing) the graph. We are working on a fix.

Small aside: TF-DF supports string categories, so there is no need to convert strings to integers.

Hi @rstz, do you know when this issue will be fixed ?

tlapusan avatar Aug 03 '22 07:08 tlapusan

Minor update: This looks like a bug in the way integerized values are treated by TF-DF. In your example Sex_label == 1 corresponds to "male" in the dataset, but it corresponds to "female" when inspecting (and drawing) the graph. We are working on a fix.

Small aside: TF-DF supports string categories, so there is no need to convert strings to integers.

@rstz Could you guide me how to find which are the integer values associated by TF-DF for a categorical features ? I'm working to integrate the TF-DF in https://github.com/parrt/dtreeviz library for visualisations and I need the integer values.

Thanks.

tlapusan avatar Aug 12 '22 07:08 tlapusan

Hi, apologies for not responding, I missed that email.

We don't have a fix yet, but this is very high on our Todo list.

rstz avatar Aug 18 '22 11:08 rstz

sounds good @rstz , thanks

tlapusan avatar Aug 18 '22 18:08 tlapusan

I believe this is solved since dtreeviz support has now landed :)

rstz avatar Apr 06 '23 10:04 rstz

@rstz there is dtreeviz now?

Arnold1 avatar Apr 07 '23 16:04 Arnold1

https://www.tensorflow.org/decision_forests/tutorials/dtreeviz_colab is a tutorial for using dtreeviz with TF-DF

rstz avatar Apr 07 '23 16:04 rstz