esc icon indicating copy to clipboard operation
esc copied to clipboard

Trying to make it work for user inputed sentences

Open LifeIsStrange opened this issue 3 years ago • 8 comments

Where the dataset-paths that you provide to the model must be in a format that follows the one introduced by Raganato et al. (2017). For reference, all the datasets in the directory data/WSD_Evaluation_Framework follow this format.

There is no folder WSD_Evaluation_Framework in data... Any tool to automatically translate a sentence in the expected format? Or any example ? @edobobo friendly ping

LifeIsStrange avatar Apr 07 '22 18:04 LifeIsStrange

Hey, as you can read in the README, you can download the datasets (or find the URL where to download them) in the setup.sh script.

edobobo avatar Apr 07 '22 19:04 edobobo

@edobobo Yes I have downloaded a dataset and looked at the format. It is unclear to me how to transform a given sentence in the expected format... when to set a word to a wf vs instance tag? How to generate thoses precise ids?

e.g. how to generate

<corpus lang="en" source="semeval2015">
<text id="d000">
<sentence id="d000.s000">
<wf lemma="this" pos="DET">This</wf>
<instance id="d000.s000.t000" lemma="document" pos="NOUN">document</instance>
<wf lemma="be" pos="VERB">is</wf>
<wf lemma="a" pos="DET">a</wf>
<instance id="d000.s000.t001" lemma="summary" pos="NOUN">summary</instance>
<wf lemma="of" pos="ADP">of</wf>
<wf lemma="the" pos="DET">the</wf>
<instance id="d000.s000.t002" lemma="european" pos="ADJ">European</instance>
<instance id="d000.s000.t003" lemma="public" pos="ADJ">Public</instance>
<instance id="d000.s000.t004" lemma="assessment" pos="NOUN">Assessment</instance>
<instance id="d000.s000.t005" lemma="report" pos="NOUN">Report</instance>
<wf lemma="(" pos=".">(</wf>
<wf lemma="epar" pos="NOUN">EPAR</wf>
<wf lemma=")" pos=".">)</wf>
<wf lemma="." pos=".">.</wf>
</sentence>
.....

LifeIsStrange avatar Apr 07 '22 19:04 LifeIsStrange

You don't need to generate the precise ids, if you want a lemma to be predicted then, you give to it the tag "instance" otherwise the tag "wf". The only requirement is that each instance id must be unique.

edobobo avatar Apr 07 '22 19:04 edobobo

Thanks a lot, is there a downside to requiring every words to be instance ?

LifeIsStrange avatar Apr 07 '22 19:04 LifeIsStrange

Nope, but you must be sure that the word (actually the lemma) with the relative POS is in the Wordnet Inventory.

edobobo avatar Apr 07 '22 20:04 edobobo

Excellent! A few remaining questions: How does it works for multi-words words ? e.g if I want to disambiguate "living room". Well it's a bad example since there is only one definition but still that's a valid token in wordnet, how to express it, not with two separate instances, right ?

LifeIsStrange avatar Apr 07 '22 20:04 LifeIsStrange

Yes, just one instance. There are examples in the datasets if you wants to be sure.

edobobo avatar Apr 07 '22 20:04 edobobo

Nice! What is the scheme used for the POS tag notation? Is it Universal POS tagging or the pen treebank scheme or another ?

lower priority questions: Is there a way to retrain this network using wordnet-english instead? Wordnet english has received much more human resources than wordnet princeton and therefore is more correct and complete.

LifeIsStrange avatar Apr 07 '22 20:04 LifeIsStrange