compromise icon indicating copy to clipboard operation
compromise copied to clipboard

Add support for synonyms?

Open flesler opened this issue 6 years ago • 11 comments

It'd be great, when normalizing text to include synonyms (and even antonyms!). So that all synonyms are normalized to the same word (and not ${antonym} too).

It shouldn't be incredibly complex, it's equivalent to using replace(synonym, normalized) for each case (but much more optimized I hope.

flesler avatar Aug 03 '18 18:08 flesler

This could make stuff like #Currency more useful, to be able to normalize to either the name or the symbol would be great (unless it can already be done and I missed it)

flesler avatar Aug 03 '18 18:08 flesler

love this idea

spencermountain avatar Aug 03 '18 18:08 spencermountain

I temporarily implemented this myself, adding the following to the "plugin":

synonyms: {
	u: 'you',
	ya: 'you',
	bc: 'because',
	r: 'are',
	sth: 'something',
	pls: 'please',
	sry: 'sorry',
	'&': 'and',
	okay: 'ok',
	congrats: 'congratulations',
	congratz: 'congratulations',
}

Then I iterate and replace(key, synonyms[key])

flesler avatar Aug 03 '18 22:08 flesler

If this helps anyone get started, pulled synonyms out of WordNet a while back, they are in WordNet format which I don't particularly like. Elasticsearch docs have some good dialogue about synonyms here https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html

https://github.com/buildbreakdo/elasticsearch-wordnet-synonyms

Think solr style synonyms are much more legible:

Solr:

"synonyms" : [
  "i-pod, i pod => ipod",
  "universe, cosmos"
]

Wordnet:

"synonyms" : [
  "s(100000001,1,'abstain',v,1,0).",
  "s(100000001,2,'refrain',v,1,0).",
  "s(100000001,3,'desist',v,1,0)."
]

Could skip all that and do something clever with getters and setters like:

const synonyms = {
  "ipod": ["ipod", "i pod", "i-pod"],
  get "i pod"() {return this["ipod"]},
  get "i-pod"() {return this["ipod"]}
  ...
}

Every word variant maps to the root word which holds all variants including itself. Used on text like cool i-pod:

"cool i-pod".split(' ').map(word => synonyms[word] || word) would output:

["cool", ["ipod", "i pod", "i-pod"]]

Back of the napkin architecture here. :) Issue see with this though (least client side) is synonyms is an 8 meg file. Be cool if this was a part of the compromise repo and is a separate package that you can include and pass to compromise?

buildbreakdo avatar Aug 07 '18 10:08 buildbreakdo

yeah! very cool @buildbreakdo another thing that would be cool about using word net is you can ensure the Part-of-Speech matches on the term, before making a swap. That will prevent errors like, when i'm really bored, i pod the .. - 😕

because compromise can reliably conjugate verbs to infinitive, and swap plurals back to singular a synonym swticher could do this:

nlp('i walked ecstatically').replace({ecstatic:'happy'}).out()
//i walked happily

you know? I haven't done this for perf reasons, but you could imagine building a clever method to do this - happy to help

spencermountain avatar Aug 07 '18 16:08 spencermountain

If it could do that, it'd be INCREDIBLY cool. Imagine something like nlp('...').paraphrase(). 💯

flesler avatar Aug 07 '18 16:08 flesler

👍

owendall avatar Sep 04 '18 19:09 owendall

@spencermountain Not sure how best to help do what you mentioned above...

image

Obviously not implemented.: -)

owendall avatar Sep 04 '18 19:09 owendall

hey owen, it would involve looping through each word and conjugating all the verbs to infinitive, and all the plural nouns to singular.

If you save that string on each term, the replace method could just loop around and look at that string.

I don't wanna do that at tag-time. It would make everything slow.

spencermountain avatar Sep 04 '18 19:09 spencermountain

but it would be a wicked plugin. one method to create this doc.cache() 'super-normalized' word. and another method to doc.replace({ecstatic:'happy'})

spencermountain avatar Sep 04 '18 19:09 spencermountain

I would not call "bc" a synonym for "because", but more a slang version, or a contracted form. Just my 2 cents.

giorgio79 avatar Oct 04 '18 13:10 giorgio79