English-to-IPA icon indicating copy to clipboard operation
English-to-IPA copied to clipboard

Get homophones?

Open youssefavx opened this issue 5 years ago • 3 comments

Hi, is there a function similar to the rhyming function but for homophones?

youssefavx avatar Sep 13 '20 14:09 youssefavx

Never mind! Sorry for the bother. I wrote this function:

import eng_to_ipa as ipa
def get_homophones(word):

    words_that_sound_the_same = []
    the_way_this_word_looks = word
    the_way_this_word_sounds = ipa.convert(word)
    words_that_contain_that_sound = ipa.contains(the_way_this_word_sounds)
    for every_word in words_that_contain_that_sound:
        the_way_that_word_looks = every_word[0]
        the_way_that_word_sounds = every_word[1]
        
        if the_way_this_word_sounds == the_way_that_word_sounds:
            if the_way_that_word_looks != the_way_this_word_looks:
                words_that_sound_the_same.append(every_word)

    return words_that_sound_the_same
                   
get_homophones('their') #[['there', 'ðɛr'], ["they're", 'ðɛr']]

youssefavx avatar Sep 13 '20 15:09 youssefavx

Just reopening to ask for any tips on this problem: I'm trying to create a function that'll basically give you a homophone that is made up of multiple words.

For instance, given the words: "breakthrough" give me the words "break" and "through" in order

I made a function that chunks that into:

[['b', 'reɪkθru'],
 ['b', 'r', 'eɪkθru'],
 ['br', 'e', 'ɪkθru'],
 ['bre', 'ɪ', 'kθru'],
 ['breɪ', 'k', 'θru'],
 ['breɪk', 'θ', 'ru'],
 ['breɪkθ', 'r', 'u'],
 ['breɪkθr', 'u']]

(This is a function I already wrote that does this.)

Then, it takes each segment, for example, br and checks for words that have exactly that sound (and no other sound) (which I can use my homophone function above to do it with).

The problem comes in dividing these arrays up into all possible ways a word could be broken up. For instance, my poor function can only do this for one character at a time but ideally you'd have something that would do it for multiple characters depending on the length of the word's ipa characters. For instance, for breakthrough, that would be 8 characters (ignoring the stress for now since I'm not sure how to deal with that). So the algorithm would then divide breakthrough up like so:

1, 7:
['b', 'reɪkθru']
['breɪkθr', 'u']
2, 6:
['br', 'eɪkθru']
['breɪkθ', 'ru']

3, 5:
['bre', 'ɪkθru']

4, 4:
['breɪ', 'kθru']

5, 3:
['breɪk', 'θru']

and so on...

But I'm posting this mostly to get any suggestions on how to go about this.

What I want to end up with is a function that I can call and get something like this:

multi_homophone('breakthrough')
[['break', 'breɪk'], ['through', 'θru']],
... (other options of combinations)

youssefavx avatar Sep 14 '20 17:09 youssefavx

Take a look at https://github.com/Kyubyong/g2p . Basically, they tag the whole sentence and then use the fact of whether or not the word is a verb to determine which homophone to use. For others, like "bowing" and "bowing", you'd have to use context clues. That'd probably necessitate a machine learning model.

nikitalita avatar Mar 05 '21 19:03 nikitalita