flair
flair copied to clipboard
[Bug]: RuntimeError: shape '[8, 89, 1024]' is invalid for input of size 705536
Describe the bug
During NER model training with TransformerWordEmbedding I run into a RunTimeError for one of my three models.
For a project I train three NER models, only one of which runs into this issue. This makes me think it is a data issue rather than a code issue. Perhaps a fix is needed in the Corpus instead of the training script.
Bug occurs at the following lines: https://github.com/flairNLP/flair/blob/b1a3e24ddec85ce62e007e1d44f8a9419215393d/flair/models/sequence_tagger_model.py#L366-L372
printing the sentences at this stage gives:
pprint.pprint([sentence.text for sentence in sentences])
['1',
'i',
'I l »',
'« I » I l',
'[DOCSEP]',
'\ufeff',
'N ° d’ entreprise : Nom ( en entier ) ; ( en abrégé ) : Forme légale : '
'Adresse complète du siège : ThaïBoxing Discovery Association Sans But '
'Lucratif Place Terdelt , 2 boîte 11 à 1030 Schaerbeek Objet de l’ acte : '
"Constitution d' une ASBL { Association Sans But Lucratif ) Texte Les "
'soussignés : 1 , \t BUNRAD Anatpong , domicilié Rue de la Roche , 1 à 1301 '
'Bîerges , né à Udon Thani ( Thaïlande ) le 14 juin 1986 ; 2 .',
'DUBREUCQ David , domicilié Place Terdelt 2 boîte 11 à 1030 Schaerbeek , né à '
'üccle le 14 décembre 1977 ; 3 .']
It seems that there is an empty string sentence with the character \ufeff
Googling finds me that changing the encoding to 'utf-8-sig' could remove this character. https://stackoverflow.com/questions/17912307/u-ufeff-in-python-string
Perhaps this bug is similar to https://github.com/flairNLP/flair/issues/1600 where this character is a 0 width?
This looks like it when seeing the training data:
To Reproduce
train a trainer using TransformerWordEmbedding and the sentences above
Expected behavior
training as normal
Logs and Stack traces
2023-05-31 04:28:38 : │ ❱ 62 │ │ trainer.fine_tune( │
2023-05-31 04:28:38 : │ 63 │ │ │ output_folder, │
2023-05-31 04:28:38 : │ 64 │ │ │ learning_rate=training_args.learning_rate, │
2023-05-31 04:28:38 : │ 65 │ │ │ mini_batch_size=training_args.mini_batch_size, │
2023-05-31 04:28:38 : │ │
2023-05-31 04:28:38 : │ /opt/conda/envs/Fortis-Named-Entity-Recognition/lib/python3.9/site-packages/ │
2023-05-31 04:28:38 : │ flair/trainers/trainer.py:899 in fine_tune │
2023-05-31 04:28:38 : │ │
2023-05-31 04:28:38 : │ 896 │ │ **trainer_args, │
2023-05-31 04:28:38 : │ 897 │ ): │
2023-05-31 04:28:38 : │ 898 │ │ │
2023-05-31 04:28:38 : │ ❱ 899 │ │ return self.train( │
2023-05-31 04:28:38 : │ 900 │ │ │ base_path=base_path, │
2023-05-31 04:28:38 : │ 901 │ │ │ learning_rate=learning_rate, │
2023-05-31 04:28:38 : │ 902 │ │ │ max_epochs=max_epochs, │
2023-05-31 04:28:38 : │ │
2023-05-31 04:28:38 : │ /opt/conda/envs/Fortis-Named-Entity-Recognition/lib/python3.9/site-packages/ │
2023-05-31 04:28:38 : │ flair/trainers/trainer.py:500 in train │
2023-05-31 04:28:38 : │ │
2023-05-31 04:28:38 : │ 497 │ │ │ │ │ for batch_step in batch_steps: │
2023-05-31 04:28:38 : │ 498 │ │ │ │ │ │ │
2023-05-31 04:28:38 : │ 499 │ │ │ │ │ │ # forward pass │
2023-05-31 04:28:38 : │ ❱ 500 │ │ │ │ │ │ loss = self.model.forward_loss(batch_step) │
2023-05-31 04:28:38 : │ 501 │ │ │ │ │ │ │
2023-05-31 04:28:38 : │ 502 │ │ │ │ │ │ if isinstance(loss, tuple): │
2023-05-31 04:28:38 : │ 503 │ │ │ │ │ │ │ average_over += loss[1] │
2023-05-31 04:28:38 : │ │
2023-05-31 04:28:38 : │ /opt/conda/envs/Fortis-Named-Entity-Recognition/lib/python3.9/site-packages/ │
2023-05-31 04:28:38 : │ flair/models/sequence_tagger_model.py:270 in forward_loss │
2023-05-31 04:28:38 : │ │
2023-05-31 04:28:38 : │ 267 │ │ │ return torch.tensor(0.0, dtype=torch.float, device=flair. │
2023-05-31 04:28:38 : │ 268 │ │ │
2023-05-31 04:28:38 : │ 269 │ │ # forward pass to get scores │
2023-05-31 04:28:38 : │ ❱ 270 │ │ scores, gold_labels = self.forward(sentences) # type: ignore │
2023-05-31 04:28:38 : │ 271 │ │ │
2023-05-31 04:28:38 : │ 272 │ │ # calculate loss given scores and labels │
2023-05-31 04:28:38 : │ 273 │ │ return self._calculate_loss(scores, gold_labels) │
2023-05-31 04:28:38 : │ │
2023-05-31 04:28:38 : │ /opt/conda/envs/Fortis-Named-Entity-Recognition/lib/python3.9/site-packages/ │
2023-05-31 04:28:38 : │ flair/models/sequence_tagger_model.py:285 in forward │
2023-05-31 04:28:38 : │ │
2023-05-31 04:28:38 : │ 282 │ │ self.embeddings.embed(sentences) │
2023-05-31 04:28:38 : │ 283 │ │ │
2023-05-31 04:28:38 : │ 284 │ │ # make a zero-padded tensor for the whole sentence │
2023-05-31 04:28:38 : │ ❱ 285 │ │ lengths, sentence_tensor = self._make_padded_tensor_for_batch │
2023-05-31 04:28:38 : │ 286 │ │ │
2023-05-31 04:28:38 : │ 287 │ │ # sort tensor in decreasing order based on lengths of sentenc │
2023-05-31 04:28:38 : │ 288 │ │ sorted_lengths, length_indices = lengths.sort(dim=0, descendi │
2023-05-31 04:28:38 : │ │
2023-05-31 04:28:38 : │ /opt/conda/envs/Fortis-Named-Entity-Recognition/lib/python3.9/site-packages/ │
2023-05-31 04:28:38 : │ flair/models/sequence_tagger_model.py:366 in _make_padded_tensor_for_batch │
2023-05-31 04:28:38 : │ │
2023-05-31 04:28:38 : │ 363 │ │ │ │ t = pre_allocated_zero_tensor[: self.embeddings.embed │
2023-05-31 04:28:38 : │ 364 │ │ │ │ all_embs.append(t) │
2023-05-31 04:28:38 : │ 365 │ │ │
2023-05-31 04:28:38 : │ ❱ 366 │ │ sentence_tensor = torch.cat(all_embs).view( │
2023-05-31 04:28:38 : │ 367 │ │ │ [ │
2023-05-31 04:28:38 : │ 368 │ │ │ │ len(sentences), │
2023-05-31 04:28:38 : │ 369 │ │ │ │ longest_token_sequence_in_batch, │
2023-05-31 04:28:38 : ╰────────────────────────────────────────────────────────────────
Screenshots
No response
Additional Context
No response
Environment
Flair version 0.11.3 Torch version 1.13.1 Transformers version 4.29.2
@alanakbik I managed to resolve it by getting rid of this invisible \ufeff character.
Could this token also be removed if it's part of the sentence
e.g.
George B-PER
Washington E-PER
\ufeff O
went O
to O
Washington S-LOC
becomes:
George B-PER
Washington E-PER
went O
to O
Washington S-LOC
I do not know where to modify this in the repo otherwise I would do a PR
Hi @Guust-Franssens seeing that you are not on the latest version, do you want to try this again on the master branch? There are already some improvements on similar issues and I think yours could also be solved already.