Clustering.jl
Clustering.jl copied to clipboard
Added pair counting fmeasure metric
I often use this metric, I think it's worth having.
refs: https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.214.7233&rep=rep1&type=pdf
Included also is precision and recall for clustering, I was not sure about the proper name (e.g. precision is already in use by Julia base).
The _pair_confusion_matrix is translated from sklearn's https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/metrics/cluster/_supervised.py#L154
there is a small duplication with the rand index, which also require this matrix, but as I did not want to modify it to use my new function right now, but in a separated or (if at all).
The _pair_confusion_matrix is translated from sklearn's https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/metrics/cluster/_supervised.py#L154
If it's a direct translation, you'll have to include the license here (or ask the scikit-learn folks if a translation of their code can be MIT licensed. But I'm guessing that would be difficult.).
I don't do much with this package, but I can review this. However, we'll need to figure out the license stuff first (i.e., do we really want to include BSD licensed code here.)
If _pair_confusion_matrix
is simple (and it sounds like it should be), you could just include a description of the code, remove it from here, and someone else can implement it. Maybe me, but it can be anyone that hasn't seen the scikit-learn code. That way, we wouldn't have to worry about the license.
Given how short and simple the code is, it probably won't have to be considered as derived from NumPy if you adapt it to make it more Julian and more efficient, as in the end the only think that will remain from NumPy is the algorithm. For example, sum(c.*c)
should we written as sum(abs2, c)
, sum(c,dims=1)[:]
as vec(sum(c, dims=1))
and so on.
Thanks, I intend to rewrite it, maybe extract common functionalities from the ARI metric, hopefully, will get to it soon enough.
I've just re-implemented this functionality in #227 to fix ARI calculations.
Great!, I will wait for it to get pushed and update this commit accordingly.