NimbusML icon indicating copy to clipboard operation
NimbusML copied to clipboard

Loading a classifier model from disk does not preserve column dtype when calling test()

Open pieths opened this issue 5 years ago • 1 comments

When a model is loaded from disk, the transforms_predictedlabelcolumnoriginalvalueconverter node is not added to the pipeline which causes the output dtype of the PredictedLabel column to be int32 rather than the expected int64.

Add the following test to the end of src\python\nimbusml\tests\pipeline\test_load_save.py to see the issue:

    def test_saving_loading_pipeline_model_does_not_change_dtype(self):
        model_nimbusml = Pipeline(
            steps=[
                ('cat',
                 OneHotVectorizer() << categorical_columns),
                ('linear',
                 FastLinearBinaryClassifier(
                     shuffle=False,
                     number_of_threads=1))])

        model_nimbusml.fit(train, label)
        metrics, score = model_nimbusml.test(test, test_label, output_scores=True)

        model_nimbusml.save_model('model.nimbusml.m')
        model_nimbusml_load = Pipeline()
        model_nimbusml_load.load_model('model.nimbusml.m')

        metrics2, score2 = model_nimbusml_load.test(test,
                                                    test_label,
                                                    output_scores=True,
                                                    evaltype="binary")

        self.assertEqual(score.dtypes[0].name,
                         score2.dtypes[0].name)

        os.remove('model.nimbusml.m')

pieths avatar Jul 16 '19 18:07 pieths

This is an issue with any classifier because the first part of the following if statement is skipped when the model is loaded from disk (aka. steps is undefined or empty).

    def _predict(self, X, y=None,

        ...

        if hasattr(self, 'steps') and len(self.steps) > 0 \
                and self.last_node.type == 'classifier':
            select_node = transforms_scorecolumnselector(
                data="$scoredVectorData",
                output_data="$scoreColumnsOnlyData", score_column="Score")
            convert_label_node = \
                transforms_predictedlabelcolumnoriginalvalueconverter(
                    data="$scoreColumnsOnlyData",
                    predicted_label_column="PredictedLabel",
                    output_data="$output_data")
            all_nodes.extend([select_node, convert_label_node])
        else:
            select_node = transforms_scorecolumnselector(
                data="$scoredVectorData",
                output_data="$output_data", score_column="Score")
            all_nodes.extend([select_node])

pieths avatar Jul 17 '19 00:07 pieths