Clustering.jl icon indicating copy to clipboard operation
Clustering.jl copied to clipboard

Added pair counting fmeasure metric

Open dinarior opened this issue 3 years ago • 5 comments

I often use this metric, I think it's worth having.

refs: https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.214.7233&rep=rep1&type=pdf

Included also is precision and recall for clustering, I was not sure about the proper name (e.g. precision is already in use by Julia base).

The _pair_confusion_matrix is translated from sklearn's https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/metrics/cluster/_supervised.py#L154

there is a small duplication with the rand index, which also require this matrix, but as I did not want to modify it to use my new function right now, but in a separated or (if at all).

dinarior avatar Aug 22 '21 17:08 dinarior

The _pair_confusion_matrix is translated from sklearn's https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/metrics/cluster/_supervised.py#L154

If it's a direct translation, you'll have to include the license here (or ask the scikit-learn folks if a translation of their code can be MIT licensed. But I'm guessing that would be difficult.).

I don't do much with this package, but I can review this. However, we'll need to figure out the license stuff first (i.e., do we really want to include BSD licensed code here.)

If _pair_confusion_matrix is simple (and it sounds like it should be), you could just include a description of the code, remove it from here, and someone else can implement it. Maybe me, but it can be anyone that hasn't seen the scikit-learn code. That way, we wouldn't have to worry about the license.

kmsquire avatar Oct 26 '21 17:10 kmsquire

Given how short and simple the code is, it probably won't have to be considered as derived from NumPy if you adapt it to make it more Julian and more efficient, as in the end the only think that will remain from NumPy is the algorithm. For example, sum(c.*c) should we written as sum(abs2, c), sum(c,dims=1)[:] as vec(sum(c, dims=1)) and so on.

nalimilan avatar Nov 27 '21 17:11 nalimilan

Thanks, I intend to rewrite it, maybe extract common functionalities from the ARI metric, hopefully, will get to it soon enough.

dinarior avatar Nov 27 '21 18:11 dinarior

I've just re-implemented this functionality in #227 to fix ARI calculations.

wildart avatar Dec 25 '21 20:12 wildart

Great!, I will wait for it to get pushed and update this commit accordingly.

dinarior avatar Dec 26 '21 07:12 dinarior