ingredient-phrase-tagger
ingredient-phrase-tagger copied to clipboard
BIO tagging/chunking bug
The first entry for the test set looks like this:
1 I1 L12 NoCAP NoPAREN B-QTY
boneless I2 L12 NoCAP NoPAREN I-COMMENT
pork I3 L12 NoCAP NoPAREN B-NAME
tenderloin I4 L12 NoCAP NoPAREN I-NAME
, I5 L12 NoCAP NoPAREN B-COMMENT
about I6 L12 NoCAP NoPAREN I-COMMENT
1 I7 L12 NoCAP NoPAREN B-QTY
pound I8 L12 NoCAP NoPAREN I-COMMENT
The corresponding CSV entry is: 20000,"1 boneless pork tenderloin, about 1 pound",pork tenderloin,1.0,0.0,,"boneless, about 1 pound"
The second token should be labelled "B-COMMENT" because there's no comment proceeding it.
The issue is with addPrefixes
and bestTag
. addPrefixes
determines that '1' is both the QTY and also part of the entry's comment so it says the possible tags are ['B-COMMENT', 'B-QTY']
it then goes to the next token and determines that it's a COMMENT but tags it as I-COMMENT
because the previous token has B-COMMENT
as a possible tag. The bestTag
picks anything over a COMMENT so it assigns the B-QTY
to the '1' and 'boneless' is then tagged incorrectly with I-COMMENT
.
Essentially, I think addPrefixes
and bestTag
should be combined into a single function since BIO chunking really needs to know what the previous tag is actually going to be.
Additionally, it may also be reasonable that if the first instance of '1' is labelled as QTY then the second should be labelled 'COMMENT', but that would be a separate issue apart from the BIO chunking.