atarashi
atarashi copied to clipboard
Improve TF-IDF agent by tuning matches threshold
Hello.
I've been playing around with some parameters of the TF-IDF agent.
I've found that if we stop using a threshold (cosine similarity >= 0.30
) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):
https://github.com/fossology/atarashi/blob/6cdd4104a278b6d993363d5989c859ab78e5e21c/atarashi/agents/tfidf.py#L124-L136
Using the evaluation.py
script, I've carried out some experiments:
Algorithm | Time elapsed | Accuracy | |
---|---|---|---|
1 | tfidf (CosineSim) (thr=0.30) | 30.19 | 59.0% |
2 | tfidf (CosineSim) (thr=0.17) | 35.29 | 61.0% |
3 | tfidf (CosineSim) (thr=0.16, max_df=0.10) | 27.34 | 62.0% |
4 | tfidf (CosineSim) (thr=0.16) | 36.42 | 62.0% |
5 | tfidf (CosineSim) (thr=0.15) | 38.45 | 62.0% |
6 | tfidf (CosineSim) (thr=0.10) | 39.91 | 62.0% |
7 | tfidf (CosineSim) (thr=0.00) | 61.49 | 62.0% |
8 | Ngram (CosineSim) | - | 57.0% |
9 | Ngram (BigramCosineSim) | - | 56.0% |
10 | Ngram (DiceSim) | - | 55.0% |
11 | wordFrequencySimilarity | - | 23.0% |
12 | DLD | - | 17.0% |
13 | tfidf (ScoreSim) | - | 13.0% |
- Row 1 shows the performance (speed and accuracy) of the current configuration of the TF-IDF agent using CosineSim as similarity measure.
- Row 7 shows how we can reach an accuracy of 62.% just by removing the threshold (
cosine similarity >= 0.00
). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is0.16
, showed in row 4. - In order to continue decreasing the excecution time and increasing the accuracy, I tuned some parameters of the TfidfVectorizer. Setting
max_df
to0.10
(default is1.0
) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.- Why does decreasing the
max_df
value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than themax_df
percent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute. - Why does decreasing the
max_df
value keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.
- Why does decreasing the
I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.
Important notes:
- I've left out the speed times for all the other algorithms, because I ran those experiments in another context, so the comparison of time wouldn't be fair.
- All the results differ from the last report I could find out there. I do not fully understand why some of them are so different; probably changes in the test files or changes in the algorithms. Anyway, 62.0% is the new best result in both reports.
- My findings may help improve other agents that use thresholds, such as Ngram.
- This new state-of-atarashi performance 😅 may also push the goals of future agents implementations, since it would be the new baseline.
That's a very detailed evaluation @xavierfigueroav . Thank you for providing the info.
Maybe, if you can provide a good overview of the baseline, we can put it on our wiki and use it to compare with different solutions (as you mentioned).