notebooks icon indicating copy to clipboard operation
notebooks copied to clipboard

[Token Classification] Need to update tokenize_and_align_labels function

Open JBAujogue opened this issue 3 years ago • 0 comments
trafficstars

The version of the tokenize_and_align_labels function of the notebook is behind the one used un the run_ner.py script located at

https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py (1)

Current behavior

Calling this function on an input of the form

words = [word1, word2] labels = [B-org, I-org]

returns

words = [word1_token1, word1_token2, word2_token1, word2_token_2] labels = [B-org, B-org, I-org, I-org]

Expected behavior

Have a result looking like

words = [word1_token1, word1_token2, word2_token1, word2_token_2] labels = [B-org, I-org, I-org, I-org]

The BIO label corresponding to the token word1_token2 should be I-org instead of B-org.

Solution

Update the notebook with the construction of the b_to_i_label dictionnary and tokenize_and_align_labels function provided in (1).

JBAujogue avatar Oct 13 '22 18:10 JBAujogue