rasa-nlu-examples icon indicating copy to clipboard operation
rasa-nlu-examples copied to clipboard

SparseSpacyFeaturizer

Open koaning opened this issue 5 years ago • 2 comments

If you have a look at all the attributes that spaCy generates for their tokens then you can imagine that some of these features can be useful for machine learning pipelines. To name a few:

  • is_oov: is the token part of the vocabulary/does it have a vector?
  • is_stop: is the token a stopword?
  • lemma_: what is the lemma of the token
  • pos/tag coarse/fine-grained part of speech information
  • morphological features
  • grammatical dependency

These can all have a discrete representation and could be added in general to a Rasa pipeline.

koaning avatar Sep 02 '20 07:09 koaning

It's probably best to wait until spaCy 3.0 before adding this one.

koaning avatar Oct 21 '20 14:10 koaning

We might also just start with is_oov, is_stop and is_numeric.

koaning avatar Jan 21 '21 09:01 koaning