trankit icon indicating copy to clipboard operation
trankit copied to clipboard

KeyError: 'lemma'

Open Bachstelze opened this issue 2 years ago • 2 comments

Following the code from https://trankit.readthedocs.io/en/latest/training.html#training-a-lemmatizer i get a KeyError: 'lemma':

Setting up training config...
Initialized lemmatizer trainer
Training dictionary-based lemmatizer

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

[<ipython-input-9-a90867cc5ef3>](https://localhost:8080/#) in <module>()
     11 
     12 # start training
---> 13 trainer.train()

3 frames

[/content/trankit/trankit/tpipeline.py](https://localhost:8080/#) in train(self)
    680             self._train_posdep()
    681         elif self._task == 'lemmatize':
--> 682             self._train_lemma()
    683         elif self._task == 'ner':
    684             self._train_ner()

[/content/trankit/trankit/tpipeline.py](https://localhost:8080/#) in _train_lemma(self)
    581 
    582     def _train_lemma(self):
--> 583         self._lemma_model.train()
    584 
    585     def _train_ner(self):

[/content/trankit/trankit/models/lemma_model.py](https://localhost:8080/#) in train(self)
    379             self.config.logger.info("Training dictionary-based lemmatizer")
    380             self.trainer.train_dict(
--> 381                 [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
    382                  not (
    383                          type(token[ID]) == tuple and len(token[ID]) == 2)])

[/content/trankit/trankit/models/lemma_model.py](https://localhost:8080/#) in <listcomp>(.0)
    381                 [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
    382                  not (
--> 383                          type(token[ID]) == tuple and len(token[ID]) == 2)])
    384             dev_preds = self.trainer.predict_dict(
    385                 [[token[TEXT], token[UPOS]] for sentence in self.dev_batch.doc for token in sentence if

KeyError: 'lemma'

The recent version from https://github.com/UniversalDependencies/UD_Thai-PUD is used as trainings and development data.

Bachstelze avatar May 26 '22 17:05 Bachstelze

There are no Lemmas in the training data. So there can't be lemmatizer?! Can't i use the the other parts of the pipeline? When i run

from trankit import Pipeline
p = Pipeline(lang='customized', cache_dir='./save_dir')

the following error occurs:

BadZipFile: File is not a zip file

Bachstelze avatar May 26 '22 19:05 Bachstelze