Group metrics by labels
Context:
We want to support class-specific results from our metrics. As a part of that we're essentially opening up the interface from the metrics to return multiple values with a label per value. For our standard functions we will return a value per label for the polygon metrics.
Before:
We'd get a single number per DatasetItem and then a single aggregate_score per evaluation.
After:
We get a {"key_1": number, ... , "key_N": number} per DatasetItem and then another {"another_key_1": number, ..., "another_key_N": number } from aggregate_score.
As a part of this we also add extra_info which is a string -> string dictionary which you can add to each dataset item in the metric and error field which we use if evaluation of a single DatasetItem fails. That way we don't fail the whole evaluation.
This PR
This changes the metrics to allow returning multiple values per metrics and changes the default metrics to return results grouped by label.
This also adds an extra method on results that allow us to pass more data than floats to the frontend via extra_info, example is that we send the weight of each dataset item with the ScalarResult right now.
Overall looks pretty clean to me - all comments are nits. My biggest concern with this change is that I don't want to overcomplicate the case where a user doesn't care about class specific results and just wants to aggregate results. As such, I wonder if it would be beneficial to change the interface of the
aggregate_resultsmethod to return a scalar and to have all metric classes implement it?
Yeah, that is a valid concern. It definitely warrants further concern to take a look at the interface of how and where we choose to group_by and aggregate. I'll try to figure out a suggestion today.