scoringutils
scoringutils copied to clipboard
Define input format for categorical forecasts
As mentioned in #604, we would like to be able to score categorical/multiclass forecasts (or whatever the name should be, see #607. What should the expected format be?
Data.frame
Are we fine with the following?
columns predicted
, observed
, predicted_class
, with both observed
and predicted_class
being factors with the same levels.
Alternative names instead of predicted_class
:
-
predicted_category
-
category
-
predicted_outcome
Vector format
All seems reasonable. I have only come across these discussed as categorical forecasts but I don't have a strong feeling that that is the right way to go.
Had a chat with ChatGPT. Point about categorical being a good overall name and multiclass being useful to distinguish between binary and non-binary seems like a good one.
Prompt
Multiclass forecasts and categorical forecasts are related concepts but not identical.
Answer
-
Multiclass Forecasts: These refer to prediction tasks where the outcome can belong to one of multiple classes. The key point is that each instance (or observation) is classified into exactly one out of several possible categories. Common in machine learning, examples include image classification where each image is categorized into one class (like 'cat', 'dog', 'bird').
-
Categorical Forecasts: This term is more general. It encompasses any forecasting where outcomes are categories rather than numerical values. This includes multiclass situations but can also refer to binary classification (where there are only two categories) or even scenarios with unordered categories (like predicting types of cuisine).
As for which is more commonly used, it depends on the context:
- In general machine learning applications, multiclass classification is very common, especially in areas like computer vision and natural language processing.
- In statistical modeling and some business applications, simpler categorical forecasts (like binary classification) are often more prevalent.
The choice between them typically depends on the specific requirements of the task at hand. In some scenarios, the distinction might not be significant, while in others, the specific nature of the categories can greatly influence the modeling approach and evaluation metrics used.
ok sounds good. Then I suggest the following.
n: number of observations, N: number of possible categories of the outcome
The data.frame input format will be
-
observed
: factor with N levels -
predicted
: numeric between 0 and 1 -
somename
: factor with N unordered factor levels One forecast comprises N rows, each possible factor level must have a prediction and predictions must sum up to 1.
The vector/matric format will be
- vector
observed
: factor of length n with N unordered factor levels - nXN matrix
predicted
, rows are observations, columns are categories. If n=1 this can also be a vector of length N. -
somename
factor of length N with N levels, representing the columns ofpredicted
.
I also suggest to move the naming of somename
to #607
Pinging @nickreich and @sbfnk in case you want to weigh in
@nickreich just raised a good point: Do we want to enforce N rows for every forecast? Say you're predicting who wins the US presidency. You have 30 candidates, but you only assign a probability > 0 to 6 of them. Do you then have to have 24 rows with zeros?
I can see several options:
- A: we enforce this strictly
- makes the format very clear
- we should probably give users a helper function that expands their data.frame and creates rows with a
predicted
value of zero for every missing category label
- B: we don't enforce this
- allows users to save storage space + interact with the function without having to do any additional formatting
- can affect scoring in undesirable ways. Let say there were initially 30 candidates, but 10 dropped out and there were only 20 left at the time you made a forecast. But you prepared your data much earlier and now your factor has 30 levels, but in reality there were only 20 options. If you use the Brier score, then it will make a difference whether you had 20 or 30 levels to begin with.
- We could potentially address this by printing helpful messages / running some checks whether the data makes sense
Noting that in the vector/matrix paradigm we have some kind of implicit enforcement anyway: the prediction matrix has to have rectangular shape. (though in the above example, you'd end up with a nx20 matrix, even though your factor had 30 levels and then the function would have to decide whether to take its N from the number of factor levels or from the dimensions of the prediction).
I'm personally leaning slightly towards strict enforcement + helper function to get there from a more liberal format that omits rows with a predicted probability of 0. What do others think? Also pinging @elray1 in case you have thoughts
I'm personally leaning slightly towards strict enforcement + helper function to get there from a more liberal format that omits rows with a predicted probability of 0.
Yes I think this makes sense. Potentially could run this for people within the as_forecast
method but maybe not if its overly complicated.