transformers icon indicating copy to clipboard operation
transformers copied to clipboard

LayoutLMv3 Processor - subword does not get assigned -100 with unusual words

Open a-ozbek opened this issue 1 year ago • 5 comments

System Info

  • transformers version: 4.23.1
  • Platform: Linux-5.4.0-1060-aws-x86_64-with-glibc2.10
  • Python version: 3.8.12
  • Huggingface_hub version: 0.10.1
  • PyTorch version (GPU?): 1.10.1+cu113 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@NielsRogge

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

import numpy as np
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)

image = (np.random.rand(100, 100, 3) * 255).astype(np.uint8)  # dummy image
words = ['pencil', '0000000000000000', 'phone']
boxes = [[1, 2, 3, 4], [10, 11, 12, 13], [20, 21, 22, 23]]
word_labels = [0, 0, 0]

encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")

print(encoding['input_ids'])
print(processor.tokenizer.convert_ids_to_tokens(encoding['input_ids'].flatten()))
print(encoding['labels'])

# Output:
# tensor([[    0, 21451,  1437, 49393,  1028,     2]])
# ['<s>', 'Ä pencil', 'Ä ', '0000000000000000', 'Ä phone', '</s>']
# tensor([[-100,    0,    0,    0,    0, -100]])

Expected behavior

Since we are passing only 3 words words = ['pencil', '0000000000000000', 'phone'], I am expecting encoding['labels'] to have only 3 non -100 labels ((encoding['labels'] != -100).sum() == 3). However the output is tensor([[-100, 0, 0, 0, 0, -100]]) where it contains 4 non -100 labels. So there is a mismatch between the input words and labels after processing. The same thing happens with '**********' word and probably other unusual "word"s.

a-ozbek avatar Oct 31 '22 11:10 a-ozbek

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Nov 30 '22 15:11 github-actions[bot]

@NielsRogge @sgugger Could this be a problem which can also affect other users as well or am I doing something wrong? (word_ids() works fine in this case by the way)

a-ozbek avatar Dec 01 '22 19:12 a-ozbek

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Dec 26 '22 15:12 github-actions[bot]

I've seen other people reporting wrong behaviour with unusual characters as well.

The logic to go from word-level labels to token-level labels is here, might be worth looking at this more in depth.

I'll mark this issue as a good first issue as I currently don't have the bandwidth to look into it.

NielsRogge avatar Jan 04 '23 18:01 NielsRogge

The problem appears to be that for certain words (like "0000000000000000"), the first word piece is the character "Ä ", which is not being counted as part of the word. As a result, the offset for the following word piece is 0, causing both words to receive a label. Apparently the issue originates from encode_batch and from there to encode_char_offsets (which is in Rust).

This is my first attempt to contribute here, so I may be completely wrong...what can I do from here to help? @NielsRogge

RoyiRa avatar Jan 23 '23 08:01 RoyiRa

Hello, may I ask you if there is anything left for me and my friends to contribute for this issue?

JuheonChu avatar Feb 10 '23 19:02 JuheonChu

The same problem arises with all BPE based tokenizers. Example with LayoutXLM:

import numpy as np
from transformers import LayoutXLMTokenizerFast

processor = LayoutXLMTokenizerFast.from_pretrained(
    "microsoft/layoutxlm-base", apply_ocr=False
)
words = ["pencil", "0000000000000000", "phone"]
boxes = [[1, 2, 3, 4], [10, 11, 12, 13], [20, 21, 22, 23]]
word_labels = [1, 2, 3]

encoding = processor(
    text=words, boxes=boxes, word_labels=word_labels, return_tensors="pt"
)

print(encoding["input_ids"])
print(processor.convert_ids_to_tokens(encoding["input_ids"].flatten()))
print(encoding["labels"])

# Output:
# tensor([[     0,   5551,  13003,      6,  28568, 197094, 197094,  24089,      2]])
# ['<s>', '▁pen', 'cil', '▁', '0000', '000000', '000000', '▁phone', '</s>']
# tensor([[-100,    1, -100,    2,    2, -100, -100,    3, -100]])

The main issue is BPE can produce "empty" token at the beginning of a word with offset_mapping = (0, 0). Which leads to the following non empty token (which is the continuation of the word) having an offset_mapping = (0, X).

Dirty solution is to check where @NielsRogge indicated and add a guard if previous token was empty. The problem is that it needs to be done for all BPE based tokenizers. Only checking if the offset_mapping starts with 0 is not sufficient when an empty token exists.

The other solution is to fix BPE (should it even be able to produce empty tokens?) in the Rust source.

The problem is NOT present in the NOT fast tokenizer provided by sentencepiece because it operates at word level instead of token level.

thibaultdouzon avatar Feb 19 '23 17:02 thibaultdouzon

Hi! First time open sourcing! Is this still an issue? I can try to take a crack at it! @a-ozbek

pbaner16 avatar Dec 07 '23 21:12 pbaner16

Hi, thanks for replying, this issue was fixed so I'll close it. Feel free to take a look at other good first issues.

NielsRogge avatar Dec 08 '23 07:12 NielsRogge