unilm icon indicating copy to clipboard operation
unilm copied to clipboard

[LMV3 Bug] ValueError: You must provide corresponding bounding boxes with running the examples on run_funsd_cord.py

Open superxii opened this issue 2 years ago • 2 comments

Describe the bug Model I am using (UniLM, MiniLM, LayoutLM ...):

The problem arises when using:

  • [ ] the official example scripts: (give details below)
  • [ * ] my own modified scripts: (give details below)

A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

  1. Follow the guide on LMV3
  2. Run following scripts python ./examples/run_funsd_cord.py \ --dataset_name funsd \ --do_train --do_eval \ --model_name_or_path microsoft/layoutlmv3-base \ --output_dir ./models/layoutlmv3-base-finetuned-funsd-500 \ --segment_level_layout 1 --visual_embed 1 --input_size 224 \ --max_steps 500 --save_steps -20 --evaluation_strategy steps --eval_steps 20 \ --learning_rate 1e-5 --gradient_accumulation_steps 1 \ --load_best_model_at_end \ --metric_for_best_model "eval_f1"

Expected behavior A traning session on funsd can be started.

But I got ValueError: You must provide corresponding bounding boxes

Full Stack Trace: `[INFO|modeling_utils.py:2275] 2023-02-14 11:04:11,403 >> loading weights file pytorch_model.bin from cache at /home/datax/.cache/huggingface/hub/models--microsoft--layoutlmv3-base/snapshots/07c9b0838ccc7b49f4c284ccc96113d1dc527ff4/pytorch_model.bin [INFO|modeling_utils.py:2857] 2023-02-14 11:04:12,415 >> All model checkpoint weights were used when initializing LayoutLMv3ForTokenClassification.

[WARNING|modeling_utils.py:2859] 2023-02-14 11:04:12,416 >> Some weights of LayoutLMv3ForTokenClassification were not initialized from the model checkpoint at microsoft/layoutlmv3-base and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 0%| | 0/1 [00:00<?, ?ba/s] Traceback (most recent call last): File "./examples/run_funsd_cord.py", line 525, in main() File "./examples/run_funsd_cord.py", line 375, in main train_dataset = train_dataset.map( File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2815, in map return self._map_single( File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 546, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 513, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/fingerprint.py", line 480, in wrapper out = func(self, *args, **kwargs) File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3236, in _map_single batch = apply_function_on_filtered_inputs( File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 3112, in apply_function_on_filtered_inputs processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) File "./examples/run_funsd_cord.py", line 315, in tokenize_and_align_labels tokenized_inputs = tokenizer( File "/home/datax/projects/unilm/venv/lib/python3.8/site-packages/transformers/models/layoutlmv3/tokenization_layoutlmv3_fast.py", line 310, in call raise ValueError("You must provide corresponding bounding boxes") ValueError: You must provide corresponding bounding boxes`

superxii avatar Feb 14 '23 03:02 superxii

same question

Forrest-ht avatar Mar 16 '23 09:03 Forrest-ht

same question!

TK12331 avatar Jul 12 '23 10:07 TK12331

You can fix this by modifying the tokenize_and_align_labels in run_funds.py tested on transformers 4.30.2

We provide each words/boxes/word_labels directly to the tokenizer.

    def tokenize_and_align_labels(examples, augmentation=False):
        images = examples["image"]
        words = examples["tokens"]
        boxes = examples["bboxes"]
        word_labels = examples["ner_tags"]
        
        tokenized_inputs = tokenizer(
            text = words,
            boxes= boxes,
            word_labels=word_labels,
            padding=False,
            truncation=True,
            return_overflowing_tokens=True,
            # We use this argument because the texts in our dataset are lists of words (with a label for each word).
            # is_split_into_words=True,
        )
    
 ....  Continued  ....

T

gregbugaj avatar Jul 14 '23 08:07 gregbugaj