saber
saber copied to clipboard
Try weighting the importance of target classes
The annotated entities in a given corpus roughly follow a Zipfian distribution. This means that some entities are repeated many many times (e.g. Human, Mouse, p53, glucose), but most entities only appear a very small number of times.
Thus, during training the model is given many many examples of some entities and very few of others. It would therefore be useful to weight the cost of making a wrong prediction on these rare entities higher, in order for the model to "pay more attention" to them.
Keras provides a nice way to do this (see class_weight argument to the fit() function). The only challenge is coming up with the weighting scheme!
todo
- [ ] Try some super simple strategy, like weighting words according to their inverse relative frequency.
- [ ] Does that improves models recall (I think it will) and hurts models precision (I think it will)?.
- [ ] Does the F1 get a boost overall (I think it will)?
- [ ] If this looks promising, switch the inverse relative frequency with TF-IDF or SoCal.