neofuzz icon indicating copy to clipboard operation
neofuzz copied to clipboard

Phonetic vectorization?

Open dkbarn opened this issue 1 year ago • 2 comments

This is more of a question than a bug report:

I have a somewhat different use case than is covered in the documentation of how to use this library. In my case, I am wanting to search for similar-sounding syllables, rather than character-by-character matching of text. So my plan is to use some sort of phonetic encoding on my corpus (i.e. Soundex, Metaphone, etc). But I am not certain how to do this in such a way that would be compatible with neofuzz's Process -- it doesn't look like scikit-learn provides an out-of-the-box Vectorizer for phonetic encoding of text. And I'm not sure if the SubWordVectorizer could somehow be leveraged for this.

Any pointers on how to achieve this with neofuzz?

dkbarn avatar Nov 04 '24 05:11 dkbarn

I'd say the easiest way is to override the preprocessor attribute of a vectorizer:

from neofuzz import Process
from sklearn.feature_extraction.text import CountVectorizer
from pyphonetics import Metaphone

metaphone = Metaphone()

def phonetic_preprocessor(text: str) -> str:
    return metaphone.phonetics(text)

vectorizer = CountVectorizer(ngram_range=ngram_range, analyzer="char", preprocessor=phonetic_preprocessor)
process = Process(vectorizer, metric="cosine")

x-tabdeveloping avatar Nov 04 '24 13:11 x-tabdeveloping

Now that you say, this would make a great addition to the docs probably

x-tabdeveloping avatar Nov 04 '24 13:11 x-tabdeveloping