go-license icon indicating copy to clipboard operation
go-license copied to clipboard

Have you tried using a bayesian classifier?

Open c4milo opened this issue 11 years ago • 4 comments

This package is great, I have a similar need and I was wondering if you tried using a Bayesian classifier for this.

c4milo avatar Sep 07 '14 07:09 c4milo

@c4milo I have not tried Bayesian classifiers yet, that is an interesting idea though! One other thing I did try was using the Jaro-Winkler distance, but that proved to be extremely expensive for what go-license is doing. Bayes seems much more correct for this sort of thing.

From my understanding, the functionality is similar to a naive Bayes classifier in that go-license will just look for certain "features" in license text, and regardless of what else is contained in the body or how it is formatted, make an optimistic assumption about what the license type is. I would be interested to see what the code and performance would look like using a Bayesian classifier, though.

ryanuber avatar Sep 07 '14 18:09 ryanuber

performance is supposedly good in Bayesian classifiers compared to K-NN. The key thing is to normalize the data as much as possible, for example, using a stemmer and removing stop words. I think it is worth trying.

c4milo avatar Sep 07 '14 18:09 c4milo

Hi @c4milo -- where you encountering problems with the existing code? Perhaps you could open a pull request of license files that weren't detected in perhaps fixtures/variants thanks all!

client9 avatar Oct 05 '15 16:10 client9

@client9 I don't think this is really an issue, but a thought ticket on perhaps a better way to do license scanning rather than the naive full-text scan we do currently. I explored this a bit, but got into the weeds when trying to distinguish similar licenses, like the *GPL's or *BSD's. I might revisit this at some point, and I think there are probably some easy performance wins we can get even with the current code.

ryanuber avatar Oct 05 '15 18:10 ryanuber