rasa icon indicating copy to clipboard operation
rasa copied to clipboard

Discovering synonyms using pre-trained Bert models

Open TemuujinE opened this issue 1 year ago • 1 comments

I have added this request as topic in Rasa forum, if you are interested then please the topic can be found here.

What problem are you trying to solve?

The general idea of this topic is that we want to implement a custom component which uses Bert for discovering new synonyms.

We have created separate files for storing synonyms for each slots and entities. Then populated those files with possible synonyms users may utter when using our chatbot. But adding new synonyms for those files by hand would become rather difficult as time passes. Hence, we are wondering if we could somehow automate this process. The reason we want to use synonyms is that each slots and entities we have defined takes in roughly 20-40 possible values. The chatbot we are developing is closed-domain retrieval-based chatbot for bank. And mapping all the variations of words users could utter to their respective synonyms would make querying our database using exact matches very simple. For test purposes we are using SQLite database for storing our answers. Full-text-search capability of MongoDB have crossed our mind but as there are completely different ways of referring to the same word we have currently given up on that.

What's your suggested solution?

  1. Create custom component which loads in pre-trained Bert model:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model = 'bert-base-uncased')

This model would work the following way:

>>> unmasker("Hello I'm a [MASK] model.")

[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
  'score': 0.1073106899857521,
  'token': 4827,
  'token_str': 'fashion'},
 {'sequence': "[CLS] hello i'm a role model. [SEP]",
  'score': 0.08774490654468536,
  'token': 2535,
  'token_str': 'role'},
 {'sequence': "[CLS] hello i'm a new model. [SEP]",
  'score': 0.05338378623127937,
  'token': 2047,
  'token_str': 'new'},
 {'sequence': "[CLS] hello i'm a super model. [SEP]",
  'score': 0.04667217284440994,
  'token': 3565,
  'token_str': 'super'},
 {'sequence': "[CLS] hello i'm a fine model. [SEP]",
  'score': 0.027095865458250046,
  'token': 2986,
  'token_str': 'fine'}]

Now this component we are trying to implement takes in user message and extracted entities from the DIET classifier. Then by using start and end indexes of each extracted entities we replace the corresponding entity value with [MASK]. After this we pass each MASK-ed sentences to the above unmasker() function for predicting the masked word.

And here is the important part, or so we believe. This model is predicting words which could be inputted in place of the MASK such that the original sentence context is preserved. Then these words the Bert model is predicting must be synonyms of original words user inputted. Could we then perhaps write these predicted words to synonyms.yml file of each slots and entities?

After these we map entity values we extracted from DIET classifier with their respective synonyms. Then, proceed to ResponseSelector.

  1. In the config.yml file we add this component after the EntitySynonymMapper component. The reason is:
  • If slot/entity values we extracted already exist in synonyms.yml file of each slots/entities then we would proceed to ResponseSelector and not activate above Bert component.
  • Otherwise apply what we described above.

Examples (if relevant)

No response

Is anything blocking this from being implemented? (if relevant)

No response

Definition of Done

No response

TemuujinE avatar Aug 05 '22 09:08 TemuujinE

Maybe fine-tuning the said Bert model using our domain data could make this whole synonym discovery process focus more on predicting domain-wise words. We haven't tried fine-tuning yet as it requires TPU for training. What do you think about this?

TemuujinE avatar Aug 05 '22 09:08 TemuujinE

➤ Maxime Verger commented:

:bulb: Heads up! We're moving issues to Jira: https://rasa-open-source.atlassian.net/browse/OSS.

From now on, this Jira board is the place where you can browse (without an account) and create issues (you'll need a free Jira account for that). This GitHub issue has already been migrated to Jira and will be closed on January 9th, 2023. Do not forget to subscribe to the corresponding Jira issue!

:arrow_right: More information in the forum: https://forum.rasa.com/t/migration-of-rasa-oss-issues-to-jira/56569.

sync-by-unito[bot] avatar Dec 16 '22 10:12 sync-by-unito[bot]