atarashi icon indicating copy to clipboard operation
atarashi copied to clipboard

Improve TF-IDF agent by tuning matches threshold

Open xavierfigueroav opened this issue 2 years ago • 1 comments

Hello.

I've been playing around with some parameters of the TF-IDF agent.

I've found that if we stop using a threshold (cosine similarity >= 0.30) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):

https://github.com/fossology/atarashi/blob/6cdd4104a278b6d993363d5989c859ab78e5e21c/atarashi/agents/tfidf.py#L124-L136

Using the evaluation.py script, I've carried out some experiments:

Algorithm Time elapsed Accuracy
1 tfidf (CosineSim) (thr=0.30) 30.19 59.0%
2 tfidf (CosineSim) (thr=0.17) 35.29 61.0%
3 tfidf (CosineSim) (thr=0.16, max_df=0.10) 27.34 62.0%
4 tfidf (CosineSim) (thr=0.16) 36.42 62.0%
5 tfidf (CosineSim) (thr=0.15) 38.45 62.0%
6 tfidf (CosineSim) (thr=0.10) 39.91 62.0%
7 tfidf (CosineSim) (thr=0.00) 61.49 62.0%
8 Ngram (CosineSim) - 57.0%
9 Ngram (BigramCosineSim) - 56.0%
10 Ngram (DiceSim) - 55.0%
11 wordFrequencySimilarity - 23.0%
12 DLD - 17.0%
13 tfidf (ScoreSim) - 13.0%
  • Row 1 shows the performance (speed and accuracy) of the current configuration of the TF-IDF agent using CosineSim as similarity measure.
  • Row 7 shows how we can reach an accuracy of 62.% just by removing the threshold (cosine similarity >= 0.00). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is 0.16, showed in row 4.
  • In order to continue decreasing the excecution time and increasing the accuracy, I tuned some parameters of the TfidfVectorizer. Setting max_df to 0.10 (default is 1.0) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.
    • Why does decreasing the max_df value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than the max_df percent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.
    • Why does decreasing the max_df value keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.

I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.

Important notes:

  • I've left out the speed times for all the other algorithms, because I ran those experiments in another context, so the comparison of time wouldn't be fair.
  • All the results differ from the last report I could find out there. I do not fully understand why some of them are so different; probably changes in the test files or changes in the algorithms. Anyway, 62.0% is the new best result in both reports.
  • My findings may help improve other agents that use thresholds, such as Ngram.
  • This new state-of-atarashi performance 😅 may also push the goals of future agents implementations, since it would be the new baseline.

xavierfigueroav avatar Mar 22 '22 09:03 xavierfigueroav

That's a very detailed evaluation @xavierfigueroav . Thank you for providing the info.

Maybe, if you can provide a good overview of the baseline, we can put it on our wiki and use it to compare with different solutions (as you mentioned).

GMishx avatar Mar 28 '22 09:03 GMishx