Inconsistent handling of whitespace tokens in Scorer.score_token_attr
I've been training tagger and parser on heavily augmented data and was surprised by poor performance (comparing to what I calculated manually on the test dataset). I narrowed it down to Scorer.score_token_attr.
How to reproduce the behaviour
Sorry, I don't have a code example, but it's pretty straight-forward. In this function, in the gold dataset, all tokens (except for those with a missing attribute) are included in evaluation. However, in the predicted dataset, whitespace tokens are excluded. If the gold dataset contains whitespace tokens (which is true in my case), we are not comparing apples to apples here and inflate the error rate.
I just created my own scorer for now, but this behavior is kind of unexpected.
Let me know if you want me to change the behavior, and I will do a PR. My own fix to the function was to add exclude_spaces parameter, defaulting to the current behavior (i.e. True), and either include or exclude whitespace tokens in both datasets.
# line 240 of scorer.py
for gold_i, token in enumerate(gold_doc):
value = getter(token, attr)
if value not in missing_values:
gold_tags.add((gold_i, getter(token, attr)))
else:
missing_indices.add(gold_i)
pred_tags = set()
for token in pred_doc:
if token.orth_.isspace(): # HERE: excluding whitespace tokens
continue
if align.x2y.lengths[token.i] == 1:
gold_i = align.x2y[token.i][0]
Thanks I'll look at this!