STT icon indicating copy to clipboard operation
STT copied to clipboard

Feature request: Scorer dealing with OOV

Open bernardohenz opened this issue 2 years ago • 2 comments

Hi,

me and my team use STT, for Brazilian Portuguese, and we were having problems when dealing with consecutive OOV (out-of-vocabulary) words. The problem was that, when receiving two or more OOV words, the decoder enters in a state that stop accepting any other word.

After some experimentation, I've taken out the return of OOV_SCORE (in https://github.com/coqui-ai/STT/blob/main/native_client/ctcdecode/scorer.cpp#L247), but adding a penalization together with the BaseScore as follows:

    // encounter OOV
    // if (word_index == lm::kUNK) {
    //   return OOV_SCORE;
    // }

    cond_prob = language_model_->BaseScore(in_state, word_index, out_state);
    if (word_index == lm::kUNK) {
       cond_prob-=10;
    }

I believe there could be a better solution for this, thus I am opening this issue for discussing a solution.

As your LM is built over a huuge corpus, I suppose that your models do not suffer from OOV words, but I believe that many people may have problems with OOV words with LMs built over smaller corpus.

bernardohenz avatar Aug 26 '21 18:08 bernardohenz

Thanks for opening! Did you also make parallel changes to the PathTrie to go with this scoring change here? Could you share them as well so we can have the same starting point?

reuben avatar Aug 26 '21 19:08 reuben

I have experimented with some changes, but as soon I changed the scorer, I undo the changes on PathTrie.

But if I am not mistaken, I just changed to return a path even when not finding on dictionary, as This code is inside get_path_trie

    if (has_dictionary_) {
      matcher_->SetState(dictionary_state_);
      bool found = matcher_->Find(new_char + 1);
      PathTrie* new_path = new PathTrie;
      new_path->character = new_char;
      new_path->timestep = new_timestep;
      new_path->parent = this;
      new_path->dictionary_ = dictionary_;
      new_path->has_dictionary_ = true;
      new_path->matcher_ = matcher_;
      new_path->log_prob_c = cur_log_prob_c;

      // set spell checker state
      // check to see if next state is final
      auto FSTZERO = fst::TropicalWeight::Zero();
      auto final_weight = dictionary_->Final(dictionary_state_);
      if (found)
        final_weight = dictionary_->Final(matcher_->Value().nextstate);
      bool is_final = (final_weight != FSTZERO);
      if ((is_final && reset) || (!found)) {
        // restart spell checker at the start state
        new_path->dictionary_state_ = dictionary_->Start();
      } else {
        // go to next state
        new_path->dictionary_state_ = matcher_->Value().nextstate;
      }
      children_.push_back(std::make_pair(new_char, new_path));
      return new_path;
    } else { .....

bernardohenz avatar Aug 26 '21 19:08 bernardohenz