PolyFuzz
PolyFuzz copied to clipboard
add fuzz transformer
Hi @MaartenGr, I have fine-tuned a fuzzy transformer for char level similarity to do fuzzy matching, you can read about how I did here: LinkedIn post explanation: https://www.linkedin.com/feed/update/urn:li:activity:6819456033992253440/ Model on hugging face hub: https://huggingface.co/shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher
Would you like me to create a pull request if it fits PolyFuzz?
Thanks, Shahrukh
You already can! PolyFuzz supports Flair
which in turn supports sentence-transformers
on which your model is based. If you run the following code, you can use the model:
from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import SentenceTransformerDocumentEmbeddings
from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]
embedding = SentenceTransformerDocumentEmbeddings('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
matcher = Embeddings(embedding, min_similarity=0)
model = PolyFuzz(matcher).match(from_list, to_list)
thanks for your response, I was able to execute the model, however, the model produces substandard results compared to actual model this is because of the fact, in my implementation before tokenization, I break the input string into characters it really helps the model optimize for the distance objective, for instance, "hello" would preprocessed as "h e l l o". Please let me know how to proceed with this, also would you like me to document this model in Readme?
Please see the results below as well
Hmmm, in that case, would it not be a matter of preprocessing the words before passing them to KeyBERT? Something like this:
from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import SentenceTransformerDocumentEmbeddings
embedding = SentenceTransformerDocumentEmbeddings('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]
from_list = [" ".join([char for char in word]) for word in from_list]
to_list = [" ".join([char for char in word]) for word in to_list]
matcher = Embeddings(embedding, min_similarity=0)
model = PolyFuzz(matcher).match(from_list, to_list)
Then you would only need to transform them back into words. I am bit hesitant adding support for a specific model that I currently have no benchmark for. Do you have a paper related to this model?
@MaartenGr I plan to write an Arxiv paper on this, however, it could take some time, in the meanwhile would you be okay, if I do I direct comparative analysis of this model with BERT based embedding model in Polyfuzz? I already have the dataset for Fuzzy benchmarking
The thing is with the dataset that you shared is that the value generated are no ground truth since they are computed with Levenshtein. A model that has a focus on char-level embeddings is therefore likely to outperform a model that is not regardless of its actual accuracy. It would be nice if you could test on a dataset that is often used for string-matching research.
Could you point me to a dataset that could be used here? Also, is there any chance we can collaborate in writing something formal (an Arxiv paper or something) about different neural approaches for string-matching?
Apologies for the late response. I believe it would take several datasets and evaluation measures to thoroughly validate the model that you created. Although I would be interested in collaborating, I am afraid I currently do not have the time to write an extensive paper on the subject.
That won't be a problem, I'm willing to do the write-ups and experimentation since I will be having the summer break from my school. It'd be great if you can help with ideas and reviewing what I do, that'd be more really great. Please let me know if that's possible for you :)
I cannot make any promises but perhaps I can make some time to review ideas and experimentations. It would be interesting to have a nice overview of string similarity based algorithms.