atarashi Problem with identifying the short license text

Generally, license contained in the source code file is either is short license itself or a block of large license which becomes difficult for the information retrieval algorithms and similarity finding algorithms to classify efficiently.

Please suggest how this should be resolved before implementing other IR (Information retrieval) algorithms.

Jun 07 '18 08:06 amanjain97

From the Discussion : let us have a working code with the large block of license, then we can work on to fine tune the algorithm, or work around.

Jun 07 '18 10:06 ag4ums

Please check with https://github.com/siemens/atarashi/commit/ca157e98caaafdf4001ab57f5ca735853bed1353

Jun 07 '18 17:06 amanjain97

It looks like the bigram cosine similarity returns a high number of bit torrent results. Given the SPDX test files, BitTorrent-1.{0|1} are repetitively high. For example, when seeing the 0BSD text, the BigramCosideSimilarity is returning BitTorrent-1.0 with highest score.

Rough idea of this is because the BitTorrent license texts are super long and cover a lot of different areas. Then, there is a high number of bigrams that match many licenses. The computation of the score already takes into account the number of bigrams matching between the reference text and the scanned test, however, maybe an additional weight to temp value when computing could be an approach to start texts with.

Jul 19 '18 08:07 mcjaeger