decision-forests
decision-forests copied to clipboard
Model Showing Extra Class Label
Hello,
When training my model for some reason it is coming up with an additional class. For example I currently have the following as classes [1, 2, 3, 4, 5] but when analyzing the tree using plot model it shows the following:
What is class 0 and why does this show up? Obviously 0% of the dataset has it but why is it here in the first place, i also believe it shows up when outputting the summary of the model. Even when turing my y label array to a set it gives the following:
Which comes from this snippet:
temp = list(zip(self.x_train, self.y_train))
random.shuffle(temp)
x_train, y_train = zip(*temp)
my_set = set(y_train)
print(my_set)
train_data = self.random_forest.make_tf_dataset(
np.array(x_train), np.array(y_train)
)
# print(len(list(train_data.as_numpy_iterator())))
self.model_6.fit(train_data, verbose=1)
Any idea whats going on?
@laneciar can you please tell me what your classes are and what loss function and metric are you using?
@Cheril311 My classes are what is in the picture above, {1, 2, 3, 4, 5} are each associated with an x row, as for loss function and metric its just the default that the Random Forest model uses, I don't specify one.
@Cheril311 Here is some source code:
Random Forest, i use the rf_model for training and the second returned model for evaluating and predicting.
ef create_single_model(self):
input_features = tf.keras.Input(shape=(self.num_features,))
# preprocessor = tf.keras.layers.Dense(self.num_features, activation=tf.nn.relu6)
# preprocess_features = preprocessor(input_features)
rf_model_1 = tfdf.keras.RandomForestModel(
verbose=1,
task=tfdf.keras.Task.CLASSIFICATION,
num_trees=self.num_of_trees,
max_depth=32,
# hyperparameter_template="benchmark_rank1@v1",
# bootstrap_size_ratio=1.0, # Optimal at 1 0.6470000147819519
categorical_algorithm="CART", # CART and RANDOM provide same accuracy 0.6470000147819519
growing_strategy="LOCAL", # LOCAL signficiantly better 0.6470000147819519
# honest=False, # honest True is slightly better 0.6470000147819519
# max_depth=5, # Caps at 32 slightly better 0.6480000019073486
# min_examples=5, # Best at 5 0.6480000019073486
# missing_value_policy="LOCAL_IMPUTATION", # No change .6480000019073486
sorting_strategy="PRESORT", # No change .6480000019073486
sparse_oblique_normalization="MIN_MAX", # Signficiantly helps 0.6850000023841858
# sparse_oblique_num_projections_exponent=2.0, # Crashes when above 2
# sparse_oblique_weights="BINARY", # Slightly better
split_axis="SPARSE_OBLIQUE", # Slightly better
# winner_take_all=True, # Slightly better
)
out = rf_model_1(input_features)
model = tf.keras.models.Model(input_features, out)
return rf_model_1, model
Training: I shuffle the x and y data so the labels are mixed up and not in order
tf.keras.utils.plot_model(
self.single_model,
to_file="./info/single_arch/model_test.png",
show_shapes=True,
show_layer_names=True,
)
temp = list(zip(self.x_train, self.y_train))
random.shuffle(temp)
x_train, y_train = zip(*temp)
train_data = self.random_forest.make_tf_dataset(
np.array(x_train), np.array(y_train)
)
# print(len(list(train_data.as_numpy_iterator())))
self.model_6.fit(train_data, verbose=1)
self.model_6.compile(["accuracy"])
validation_data = self.random_forest.make_tf_dataset(self.x_test, self.y_test)
evaluation_df6_only = self.model_6.evaluate(validation_data, return_dict=True)
with open("./info/single_arch/model_6.html", "w") as f:
f.write(
tfdf.model_plotter.plot_model(self.model_6, tree_idx=0, max_depth=10)
)
print("Accuracy (D6 only): ", evaluation_df6_only["accuracy"])
Hope this helps, let me know if you want anything else.
Hi Lanceciar,
This class 0 is an artifact of the way classes are handled internally. This class 0 represents the out-of-vocabulary values. However, since out-of-vocabulary values are not permitted for labels, it is always 0.
Thanks for the heads-up. We will resolve it :).