cvms
cvms copied to clipboard
Multi-class confusion matrix with 0 values
Hej L, I'm using your functions to visualize a multiclass confusion matrix and it gives me issues when some of the categories are never predicted for some of the targets. Example data: https://www.dropbox.com/s/wc1ytv1ro9kyxow/predictions.csv?dl=0
"My" code (straight from the vignette):
conf_mat <- confusion_matrix(targets = Predictions$Reference,
predictions = Predictions$Prediction)
plot_confusion_matrix(conf_mat$Confusion Matrix
[[1]])
The error:
1: In plot_confusion_matrix(conf_mat$Confusion Matrix
[[1]]) :
'ggimage' is missing. Will not plot arrows and zero-shading.
2: In plot_confusion_matrix(conf_mat$Confusion Matrix
[[1]]) :
'rsvg' is missing. Will not plot arrows and zero-shading.
The plot: https://www.dropbox.com/s/cr6n0c7rcsv1ik6/confmat.jpeg?dl=0
Hi Ric,
There are two things at play here:
- You need to install
ggimage
andrsvg
. These will plot the arrows and stripes ("zero-shading"). The reason they are not installed by default withcvms
is that they don't exist on all platforms. - The defaults have changed in the latest version, such that tiles with 0 values do not show the numbers. With the zero-shading this makes the plot more clean. It is not for everyone though, so you can get them back by setting (in
plot_confusion_matrix
):
rm_zero_text = FALSE,
rm_zero_percentages = FALSE,
(rm_zero_percentages
is of course optional)
If you think this is a bad default, do let me know. But have a look after installing those packages first, as I think it is easier to look at than a lot of 0s, that don't really add a lot of info anyway.
Also check the new add_sums=TRUE
option. Might be useful to you some day :)
now it all works, so it could be closed. But a quick comment: I guess the convention for conf matrices is to put focus on the performance. So, e.g. here: https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0196836.g006, you see how accurate the classification was. However, cvms classification puts emphasis instead on performance baselined by how many of the total datapoints are within that class. SO here (https://www.dropbox.com/s/hcy8509w7q6ul7y/ConfusionMatrix.jpeg?dl=0) you have that a perfect classification gives a 25% score. That might be confusing to people reading the confusion matrix but being used to the other convention.
The information is there though. What is shown in the PLOS plot are the column percentages. So, how big a percentage is this tile out of the tiles in the column. That's the small percentage sign at the bottom of the tiles in my plot. The percentage to the right is the row percentage.
In the "correct prediction" diagonal, column percentages are the class-level recall/sensitivity scores, while the row percentages are the class-level precision/positive predictive values.
I'm not completely sure right now, why they call it accuracy. It doesn't make sense to have accuracy scores outside the diagonal, and inside they are saying: given that a prediction is class A, how often was it predicted to be class A? And that is the recall score.
Often (don't know about the literature) people make multiple plots with 1) the row-normalized, 2) the column-normalized and 3) the count / overall normalized values. The cvms plot has them all. Furthermore, I would say it is a huge benefit to have the overall normalized values (and/or counts) so you can see any class imbalances in the dataset (which often explain impressive metrics in the results section).
Let me know what you think. I would love to make things more clear, if needed. :)
Yes, I also love the information provided, this is more about what has emphasis. I'll think a bit more about the cognitive ergonomics of this (aka share the plots w co-authors and let them think for me by misunderstanding things) and comment back.
Great :)
@fusaroli closing this for now. Feel free to reopen any time!