gramex-nlg icon indicating copy to clipboard operation
gramex-nlg copied to clipboard

Better support for detecting inflections

Open jaidevd opened this issue 5 years ago • 0 comments

Currently, an inflection is defined as one of the five following modifications to a word:

  • uppercase
  • lowercase
  • capitalization
  • singularization
  • pluralization.

The detection mechanism works by comparing the lemmas of two words and trying to find which of the above are applied to the first argument to convert it into the second argument. This only detects one inflection at a time, whereas there may be more. For example:

# Say `df` is the actors dataset
# (https://github.com/gramener/gramex-nlg/blob/dev/nlg/tests/data/actors.csv)
from nlg.utils import load_spacy_model
from nlg.grammar import _token_inflections

nlp = load_spacy_model()
text = "James Stewart is the highest rated actor."
doc = nlp(text)
X = nlp('Actors')[0]  # This is df['category'].iloc[0]
Y = doc[-2]
print(_token_inflections(X, Y))
# <function singular>

whereas it should indicate the fact that the inflection is both a singularization and lowercasing.

Also, the API is inconsistent. The first three modifications are Python string methods, but the other two are NLG functions. So the detector may return either a callable or a string representing a string method :man_facepalming:

ToDo

  • [x] Detect multiple inflections (for the five above, we can at best detect two inflections: one of first three, one of last two)
  • [x] See if order matters (for the five above, ~~it should not~~) - it does
  • [ ] Support more inflections, like PoS tag changes

jaidevd avatar Jan 08 '20 05:01 jaidevd