cracking-the-da-vinci-code-with-google-interview-problems-and-nlp-in-python icon indicating copy to clipboard operation
cracking-the-da-vinci-code-with-google-interview-problems-and-nlp-in-python copied to clipboard

Possible speedup when using n-grams

Open VitamintK opened this issue 8 years ago • 1 comments
trafficstars

Hey lucas! If you're going to use a list of frequent bigrams/trigrams anyways, you don't need to do the first part where you generate all sane permutations of the input string, right?

That is, if you have a list common_ngrams_dict, you could just do

for trigram in common_ngrams_dict:
    if sorted(trigram) == sorted(input_string):
        filtered_solution.append(trigram)

Which would avoid the O(n!) stuff in the first part. I still enjoyed the writeup, and I see how the first part still makes sense cause you'd still want to generate all sane permutations if you use the NLP approach and the anagram isn't a frequent n-gram. :+1:

VitamintK avatar Apr 07 '17 00:04 VitamintK

Good find @VitamintK! Your workaround is great and I should probably add it into the README.

However for these scenarios:

  • final anagram is not in trigram set
  • final anagram with 3 words can result in anagram with != 3 words (we'd also need to change the recursive code to perform space insertions / no space insertions in each recursive call)

In practice, a good way to solve this puzzle would be (like you said) to run bi-gram/trigram/quadgram matching with the input string to first quickly remove those candidates. But! If that fails, we still need to fallback on the valid-word permutation generation with NLP analysis on the results.

💯 👍 🍷

codelucas avatar Apr 07 '17 03:04 codelucas