ludwig icon indicating copy to clipboard operation
ludwig copied to clipboard

Usuability: Change meaning of "positive_label" from an index value to human understandable category value.

Open jimthompson5802 opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe. Not related to a problem. This is a usability improvement.

Describe the use case Two visualizations (roc_curves and binary_threshold_vs_metric) require the positive_label parameter if the output feature is a category type. Currently positive_label is defined as "positive_label must be an integer, to find the integer label associated with a class check the ground_truth_metadata JSON file".

This requires the user to view training_set_metadata.json to determine the integer value associated with the desired category by viewing the str2idx entry. For example, if the desired positive class is "Female", the positive_label will be 2.

Here is an excerpt of a category feature from training_set_metadta.json:

    "sex": {
        "idx2str": [
            "<UNK>",
            " Male",
            " Female"
        ],
        "preprocessing": {
            "computed_fill_value": "<UNK>",
            "fill_value": "<UNK>",
            "lowercase": false,
            "missing_value_strategy": "fill_with_const",
            "most_common": 10000
        },
        "str2freq": {
            " Female": 16192,
            " Male": 32650,
            "<UNK>": 0
        },
        "str2idx": {
            " Female": 2,
            " Male": 1,
            "<UNK>": 0
        },
        "vocab_size": 3
    },

A futher complication is that ordering of category classes, which affects the index value, is based on frequency of the category class. This is from the category feature documentation, "Categories are mapped to integers by first collecting a dictionary of all unique category strings present in the column of the dataset, ranking them descending by frequency and assigning a sequential integer ID from the most frequent to the most rare (with 0 assigned to the special unknown placeholder token <UNK>). "

If the model is re-trained with new data and if the relative frequency of the classes are different, then the index value for a class will change. For the example shown, if "Female" is the more frequent class in the new training, then positive_label will be 1.

Describe the solution you'd like Instead of positive_label representing an integer, a more user friendly approach is to use class label string, i.e., postitve_label should be specified as "Female". This is more understandable by the user and is stable regardless of the make up of the data.

Describe alternatives you've considered None

Additional context None

jimthompson5802 avatar Apr 09 '22 10:04 jimthompson5802