PolyFuzz icon indicating copy to clipboard operation
PolyFuzz copied to clipboard

Separate/together way of writing and synonymes aren't recognized

Open e-orlov opened this issue 2 years ago • 3 comments

Keywords "trinkwasser test", "trinkwassertest" and "analyse trinkwasser" aren't clustered at all.

e-orlov avatar Nov 15 '21 10:11 e-orlov

Which version of PolyFuzz are you using? Also, could you create a reproducible example? Since PolyFuzz can use many models, without any code it is difficult to see what is happening in your use case.

MaartenGr avatar Nov 15 '21 11:11 MaartenGr

I'm using IF-IDF, implemented under https://share.streamlit.io/charlywargnier/keyword-clustering-app/main/app.py / https://github.com/searchsolved/search-solved-public-seo/blob/main/Keyword_Clustering_Tool/Keyword_Clustering_Tool_V2.ipynb (codeblock 12)

Keywords are here: https://docs.google.com/spreadsheets/d/1nkiFNO8JadbaFcL7BvYKCLNPYPB5ILJwk2K__2DOzdc/edit?usp=sharing

Maybe PolyFuzz is not a right tool for this. To catch "trinkwasser test" and "trinkwassertest" into the same cluster, keywords must be permutated and then searched for a minimal Levenshteyn between permutations. But for "trinkwasser test" and "analyse trinkwasser" the should be probably any "real" synonyme search, maybe even based on a synonym vocabulary...

e-orlov avatar Nov 15 '21 11:11 e-orlov

Let me start by saying that I cannot give much support for that tool specifically as I did not create it. Having said that, I did try it out with PolyFuzz directly and it seems that "trinkwasses test" gets grouped with "trinkwassertest" but not with "analyse trinkwasser". Most likely, using TF-IDF they are simply not similar enough to each other. You can try to circumvent this issue by using a different technique than TF-IDF as it tries to mirror Levenshtein distance.

You can implement or use any distance measure in PolyFuzz that you would like. However, if you are looking at semantic similarity and not such much string similarity, then I would advise going for embedding-based methods such as BERT models, sentence-transformers, Hugging Face, or Flair.

You can find more information about that here and here.

MaartenGr avatar Nov 15 '21 12:11 MaartenGr