PolyFuzz icon indicating copy to clipboard operation
PolyFuzz copied to clipboard

add fuzz transformer

Open shahrukhx01 opened this issue 2 years ago • 9 comments

Hi @MaartenGr, I have fine-tuned a fuzzy transformer for char level similarity to do fuzzy matching, you can read about how I did here: LinkedIn post explanation: https://www.linkedin.com/feed/update/urn:li:activity:6819456033992253440/ Model on hugging face hub: https://huggingface.co/shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher

Would you like me to create a pull request if it fits PolyFuzz?

Thanks, Shahrukh

shahrukhx01 avatar Jul 10 '21 21:07 shahrukhx01

You already can! PolyFuzz supports Flair which in turn supports sentence-transformers on which your model is based. If you run the following code, you can use the model:

from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import SentenceTransformerDocumentEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

embedding = SentenceTransformerDocumentEmbeddings('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
matcher = Embeddings(embedding, min_similarity=0)
model = PolyFuzz(matcher).match(from_list, to_list)

MaartenGr avatar Jul 11 '21 07:07 MaartenGr

thanks for your response, I was able to execute the model, however, the model produces substandard results compared to actual model this is because of the fact, in my implementation before tokenization, I break the input string into characters it really helps the model optimize for the distance objective, for instance, "hello" would preprocessed as "h e l l o". Please let me know how to proceed with this, also would you like me to document this model in Readme? Please see the results below as well 2416004

shahrukhx01 avatar Jul 11 '21 08:07 shahrukhx01

Hmmm, in that case, would it not be a matter of preprocessing the words before passing them to KeyBERT? Something like this:

from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import SentenceTransformerDocumentEmbeddings

embedding = SentenceTransformerDocumentEmbeddings('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

from_list = [" ".join([char for char in word]) for word in from_list]
to_list = [" ".join([char for char in word]) for word in to_list]

matcher = Embeddings(embedding, min_similarity=0)
model = PolyFuzz(matcher).match(from_list, to_list)

Then you would only need to transform them back into words. I am bit hesitant adding support for a specific model that I currently have no benchmark for. Do you have a paper related to this model?

MaartenGr avatar Jul 15 '21 06:07 MaartenGr

@MaartenGr I plan to write an Arxiv paper on this, however, it could take some time, in the meanwhile would you be okay, if I do I direct comparative analysis of this model with BERT based embedding model in Polyfuzz? I already have the dataset for Fuzzy benchmarking

shahrukhx01 avatar Jul 15 '21 06:07 shahrukhx01

The thing is with the dataset that you shared is that the value generated are no ground truth since they are computed with Levenshtein. A model that has a focus on char-level embeddings is therefore likely to outperform a model that is not regardless of its actual accuracy. It would be nice if you could test on a dataset that is often used for string-matching research.

MaartenGr avatar Jul 20 '21 08:07 MaartenGr

Could you point me to a dataset that could be used here? Also, is there any chance we can collaborate in writing something formal (an Arxiv paper or something) about different neural approaches for string-matching?

shahrukhx01 avatar Jul 20 '21 09:07 shahrukhx01

Apologies for the late response. I believe it would take several datasets and evaluation measures to thoroughly validate the model that you created. Although I would be interested in collaborating, I am afraid I currently do not have the time to write an extensive paper on the subject.

MaartenGr avatar Jul 27 '21 11:07 MaartenGr

That won't be a problem, I'm willing to do the write-ups and experimentation since I will be having the summer break from my school. It'd be great if you can help with ideas and reviewing what I do, that'd be more really great. Please let me know if that's possible for you :)

shahrukhx01 avatar Jul 27 '21 11:07 shahrukhx01

I cannot make any promises but perhaps I can make some time to review ideas and experimentations. It would be interesting to have a nice overview of string similarity based algorithms.

MaartenGr avatar Aug 03 '21 05:08 MaartenGr