Inconsistent handling of whitespace tokens in Scorer.score_token_attr

Open nrodnova opened this issue 11 months ago • 1 comments

I've been training tagger and parser on heavily augmented data and was surprised by poor performance (comparing to what I calculated manually on the test dataset). I narrowed it down to Scorer.score_token_attr.

How to reproduce the behaviour

Sorry, I don't have a code example, but it's pretty straight-forward. In this function, in the gold dataset, all tokens (except for those with a missing attribute) are included in evaluation. However, in the predicted dataset, whitespace tokens are excluded. If the gold dataset contains whitespace tokens (which is true in my case), we are not comparing apples to apples here and inflate the error rate.

I just created my own scorer for now, but this behavior is kind of unexpected.

Let me know if you want me to change the behavior, and I will do a PR. My own fix to the function was to add exclude_spaces parameter, defaulting to the current behavior (i.e. True), and either include or exclude whitespace tokens in both datasets.

# line 240 of scorer.py
for gold_i, token in enumerate(gold_doc):
        value = getter(token, attr)
        if value not in missing_values:
            gold_tags.add((gold_i, getter(token, attr)))
        else:
            missing_indices.add(gold_i)
    pred_tags = set()
    for token in pred_doc:
        if token.orth_.isspace(): # HERE: excluding whitespace tokens
            continue
        if align.x2y.lengths[token.i] == 1:
            gold_i = align.x2y[token.i][0]

Jan 31 '25 16:01 nrodnova

Thanks I'll look at this!

Apr 11 '25 18:04 honnibal