decision-forests icon indicating copy to clipboard operation
decision-forests copied to clipboard

Model Showing Extra Class Label

Open laneciar opened this issue 2 years ago • 4 comments

Hello,

When training my model for some reason it is coming up with an additional class. For example I currently have the following as classes [1, 2, 3, 4, 5] but when analyzing the tree using plot model it shows the following:

image

What is class 0 and why does this show up? Obviously 0% of the dataset has it but why is it here in the first place, i also believe it shows up when outputting the summary of the model. Even when turing my y label array to a set it gives the following:

image

Which comes from this snippet:

       temp = list(zip(self.x_train, self.y_train))
        random.shuffle(temp)
        x_train, y_train = zip(*temp)
        my_set = set(y_train)
        print(my_set)
        train_data = self.random_forest.make_tf_dataset(
            np.array(x_train), np.array(y_train)
        )

        # print(len(list(train_data.as_numpy_iterator())))
        self.model_6.fit(train_data, verbose=1)

Any idea whats going on?

laneciar avatar Apr 23 '22 02:04 laneciar

@laneciar can you please tell me what your classes are and what loss function and metric are you using?

Cheril311 avatar Apr 25 '22 06:04 Cheril311

@Cheril311 My classes are what is in the picture above, {1, 2, 3, 4, 5} are each associated with an x row, as for loss function and metric its just the default that the Random Forest model uses, I don't specify one.

laneciar avatar Apr 26 '22 04:04 laneciar

@Cheril311 Here is some source code:

Random Forest, i use the rf_model for training and the second returned model for evaluating and predicting.

ef create_single_model(self):
        input_features = tf.keras.Input(shape=(self.num_features,))

        # preprocessor = tf.keras.layers.Dense(self.num_features, activation=tf.nn.relu6)
        # preprocess_features = preprocessor(input_features)

        rf_model_1 = tfdf.keras.RandomForestModel(
            verbose=1,
            task=tfdf.keras.Task.CLASSIFICATION,
            num_trees=self.num_of_trees,
            max_depth=32,
            # hyperparameter_template="benchmark_rank1@v1",
            # bootstrap_size_ratio=1.0,  # Optimal at 1  0.6470000147819519
            categorical_algorithm="CART",  # CART and RANDOM provide same accuracy 0.6470000147819519
            growing_strategy="LOCAL",  # LOCAL signficiantly better 0.6470000147819519
            # honest=False,  # honest True is slightly better 0.6470000147819519
            # max_depth=5,  # Caps at 32 slightly better  0.6480000019073486
            # min_examples=5,  # Best at 5  0.6480000019073486
            # missing_value_policy="LOCAL_IMPUTATION",  # No change .6480000019073486
            sorting_strategy="PRESORT",  # No change .6480000019073486
            sparse_oblique_normalization="MIN_MAX",  # Signficiantly helps 0.6850000023841858
            # sparse_oblique_num_projections_exponent=2.0,  # Crashes when above 2
            # sparse_oblique_weights="BINARY",  # Slightly better
            split_axis="SPARSE_OBLIQUE",  # Slightly better
            # winner_take_all=True,  # Slightly better
        )
        out = rf_model_1(input_features)

        model = tf.keras.models.Model(input_features, out)

        return rf_model_1, model

Training: I shuffle the x and y data so the labels are mixed up and not in order

tf.keras.utils.plot_model(
            self.single_model,
            to_file="./info/single_arch/model_test.png",
            show_shapes=True,
            show_layer_names=True,
        )

        temp = list(zip(self.x_train, self.y_train))
        random.shuffle(temp)
        x_train, y_train = zip(*temp)
        train_data = self.random_forest.make_tf_dataset(
            np.array(x_train), np.array(y_train)
        )

        # print(len(list(train_data.as_numpy_iterator())))
        self.model_6.fit(train_data, verbose=1)

        self.model_6.compile(["accuracy"])
        validation_data = self.random_forest.make_tf_dataset(self.x_test, self.y_test)
        evaluation_df6_only = self.model_6.evaluate(validation_data, return_dict=True)

        with open("./info/single_arch/model_6.html", "w") as f:
            f.write(
                tfdf.model_plotter.plot_model(self.model_6, tree_idx=0, max_depth=10)
            )
        print("Accuracy (D6 only): ", evaluation_df6_only["accuracy"])

Hope this helps, let me know if you want anything else.

laneciar avatar Apr 26 '22 04:04 laneciar

Hi Lanceciar,

This class 0 is an artifact of the way classes are handled internally. This class 0 represents the out-of-vocabulary values. However, since out-of-vocabulary values are not permitted for labels, it is always 0.

Thanks for the heads-up. We will resolve it :).

achoum avatar Jun 22 '22 06:06 achoum