snips-nlu
snips-nlu copied to clipboard
[Slot Filling] Improve data augmentation to make sure possible tag transitions are well represented
Problem description
Short description
In certain conditions some CRF tags transitions can by missing after the data augmentation or can be "underrepresented". We must ensure that all possible tags transitions are in the augmented dataset so that inference does not fail systematically on those examples
Example
Given a dataset with 1 intent and 3 slots: slot_1
, slot_2
, slot_3
If in the dataset only has 5% utterances with the following pattern: bla bla [slot_1] [slot_2] bla bla
and slot_1
only has 5% of length 1 entity values and 95% of length 2 entities values. Then when augmenting the data the probability of getting a the pattern B-slot-1 B-slot-2
in your training data becomes 0,0025 and will probably missing from your training data.
If slot_1
has the value word_1
and slot_2
has the value word_2 word_3
, if the CRF sees: "word_1 word_2 word_3"
then it will tag it as "B-slot-1 I-slot-1 B-slot-2"
instead of "B-slot-1 B-slot-2 I-slot-2"
because it has never seen this transition in the training data.
Now let's say that unluckily people use 95% of the time the length 1 value of the slot 1
then it means that the CRF will systematically fail in 95%*5%=4.75% of the cases, which is pretty high
Potential solutions
- Make sure that all possible tags transitions are in the augmented dataset
- Boost the proportion of rare tags transitions (this might have a negative impact on performances since CRF transitions weights might be impacted :s)