transformers
transformers copied to clipboard
LayoutLMv3 Processor - subword does not get assigned -100 with unusual words
System Info
-
transformers
version: 4.23.1 - Platform: Linux-5.4.0-1060-aws-x86_64-with-glibc2.10
- Python version: 3.8.12
- Huggingface_hub version: 0.10.1
- PyTorch version (GPU?): 1.10.1+cu113 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
@NielsRogge
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
import numpy as np
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
image = (np.random.rand(100, 100, 3) * 255).astype(np.uint8) # dummy image
words = ['pencil', '0000000000000000', 'phone']
boxes = [[1, 2, 3, 4], [10, 11, 12, 13], [20, 21, 22, 23]]
word_labels = [0, 0, 0]
encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
print(encoding['input_ids'])
print(processor.tokenizer.convert_ids_to_tokens(encoding['input_ids'].flatten()))
print(encoding['labels'])
# Output:
# tensor([[ 0, 21451, 1437, 49393, 1028, 2]])
# ['<s>', 'Ä pencil', 'Ä ', '0000000000000000', 'Ä phone', '</s>']
# tensor([[-100, 0, 0, 0, 0, -100]])
Expected behavior
Since we are passing only 3 words words = ['pencil', '0000000000000000', 'phone']
, I am expecting encoding['labels']
to have only 3 non -100 labels ((encoding['labels'] != -100).sum() == 3
).
However the output is tensor([[-100, 0, 0, 0, 0, -100]])
where it contains 4 non -100 labels. So there is a mismatch between the input words and labels after processing. The same thing happens with '**********' word and probably other unusual "word"s.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@NielsRogge @sgugger Could this be a problem which can also affect other users as well or am I doing something wrong? (word_ids()
works fine in this case by the way)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I've seen other people reporting wrong behaviour with unusual characters as well.
The logic to go from word-level labels to token-level labels is here, might be worth looking at this more in depth.
I'll mark this issue as a good first issue as I currently don't have the bandwidth to look into it.
The problem appears to be that for certain words (like "0000000000000000"), the first word piece is the character "Ä ", which is not being counted as part of the word. As a result, the offset for the following word piece is 0, causing both words to receive a label. Apparently the issue originates from encode_batch
and from there to encode_char_offsets
(which is in Rust).
This is my first attempt to contribute here, so I may be completely wrong...what can I do from here to help? @NielsRogge
Hello, may I ask you if there is anything left for me and my friends to contribute for this issue?
The same problem arises with all BPE based tokenizers. Example with LayoutXLM:
import numpy as np
from transformers import LayoutXLMTokenizerFast
processor = LayoutXLMTokenizerFast.from_pretrained(
"microsoft/layoutxlm-base", apply_ocr=False
)
words = ["pencil", "0000000000000000", "phone"]
boxes = [[1, 2, 3, 4], [10, 11, 12, 13], [20, 21, 22, 23]]
word_labels = [1, 2, 3]
encoding = processor(
text=words, boxes=boxes, word_labels=word_labels, return_tensors="pt"
)
print(encoding["input_ids"])
print(processor.convert_ids_to_tokens(encoding["input_ids"].flatten()))
print(encoding["labels"])
# Output:
# tensor([[ 0, 5551, 13003, 6, 28568, 197094, 197094, 24089, 2]])
# ['<s>', 'âpen', 'cil', 'â', '0000', '000000', '000000', 'âphone', '</s>']
# tensor([[-100, 1, -100, 2, 2, -100, -100, 3, -100]])
The main issue is BPE can produce "empty" token at the beginning of a word with offset_mapping = (0, 0)
. Which leads to the following non empty token (which is the continuation of the word) having an offset_mapping = (0, X)
.
Dirty solution is to check where @NielsRogge indicated and add a guard if previous token was empty. The problem is that it needs to be done for all BPE based tokenizers. Only checking if the offset_mapping
starts with 0 is not sufficient when an empty token exists.
The other solution is to fix BPE (should it even be able to produce empty tokens?) in the Rust source.
The problem is NOT present in the NOT fast tokenizer provided by sentencepiece
because it operates at word level instead of token level.
Hi! First time open sourcing! Is this still an issue? I can try to take a crack at it! @a-ozbek
Hi, thanks for replying, this issue was fixed so I'll close it. Feel free to take a look at other good first issues.