ludwig
ludwig copied to clipboard
Usuability: Change meaning of "positive_label" from an index value to human understandable category value.
Is your feature request related to a problem? Please describe. Not related to a problem. This is a usability improvement.
Describe the use case
Two visualizations (roc_curves and binary_threshold_vs_metric) require the positive_label
parameter if the output feature is a category
type. Currently positive_label
is defined as "positive_label must be an integer, to find the integer label associated with a class check the ground_truth_metadata JSON file".
This requires the user to view training_set_metadata.json
to determine the integer value associated with the desired category by viewing the str2idx
entry. For example, if the desired positive class is "Female", the positive_label
will be 2.
Here is an excerpt of a category
feature from training_set_metadta.json
:
"sex": {
"idx2str": [
"<UNK>",
" Male",
" Female"
],
"preprocessing": {
"computed_fill_value": "<UNK>",
"fill_value": "<UNK>",
"lowercase": false,
"missing_value_strategy": "fill_with_const",
"most_common": 10000
},
"str2freq": {
" Female": 16192,
" Male": 32650,
"<UNK>": 0
},
"str2idx": {
" Female": 2,
" Male": 1,
"<UNK>": 0
},
"vocab_size": 3
},
A futher complication is that ordering of category classes, which affects the index value, is based on frequency of the category class. This is from the category
feature documentation, "Categories are mapped to integers by first collecting a dictionary of all unique category strings present in the column of the dataset, ranking them descending by frequency and assigning a sequential integer ID from the most frequent to the most rare (with 0 assigned to the special unknown placeholder token <UNK>). "
If the model is re-trained with new data and if the relative frequency of the classes are different, then the index value for a class will change. For the example shown, if "Female" is the more frequent class in the new training, then positive_label
will be 1.
Describe the solution you'd like
Instead of positive_label
representing an integer, a more user friendly approach is to use class label string, i.e., postitve_label
should be specified as "Female". This is more understandable by the user and is stable regardless of the make up of the data.
Describe alternatives you've considered None
Additional context None