awesome-nlp-polish icon indicating copy to clipboard operation
awesome-nlp-polish copied to clipboard

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.

awesome-nlp-polish

A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.

Awesome NLP Polish Logo

Table of Contents:

  • Polish text data
  • Models and embeddings
  • Libraries and tools
  • Papers, articles, blogs
  • Contribution

Polish text datasets

Task oriented datsets

Raw texts

Models and Embeddings

Polish Transformer models

  • Polish Roberta Model - model was trained on a corpus consisting of Polish Wikipedia dump, Polish books and articles, Polish Parliamentary Corpus
  • PoLitBert - Polish RoBERTA model trained on Polish Wikipedia, Polish literature and Oscar. Major assumption is that quality text will give good model.
  • PolBert - Polish BERT model. Model was trained with code provided in Google BERT's github repository. Merge with huggingface/Transformers
  • Allegro HerBERT - Polish BERT model trained on Polish Corpora using only MLM objective with dynamic masking of whole words.
  • SlavicBert - multilingual BERT model -BERT, Slavic Cased: 4 languages(Bulgarian,Czech, Polish, Russian), 12-layer, 768-hidden, 12-heads, 110M parameters, 600Mb. There is also another SlavicBert model http://docs.deeppavlov.ai/en/master/features/models/bert.html but I have problems to convert it to pytorch.

Other models

Language processing tools and libraries

  • Morfologik (Java) and pyMorfologik (Python wrapper) - dictionary-based morphological analyzer

  • Morfeusz - morphological analyzer. See also Elasticsearch plugin

  • Stempel (Python port) - algorithmic stemmer. See also Elasticsearch plugin

  • spaCy for Polish - extend spaCy, a popular production-ready NLP library, to fully support Polish language.

  • spacy-pl by IPI PAN - integrating existing Polish language tools and resources into the spaCy pipeline

  • KRNNT Polish morphological tagger - KRNNT is a morphological tagger for Polish based on recurrent neural networks Paper

  • Stanza (Python) - NLP analysis package from Stanford University. Stanza is a Python natural language analysis package. It contains tools, which can be used for: sentence/word tokenizing, to generate base forms of words, parts of speech and morphological features, syntactic dependency parsing, recognizing named entities. Contains Polish model

  • Duckling (Haskel) - library for parsing text into structured data with support for Polish

  • A curated list of Polish abbreviations for NLTK sentence tokenizer based on Wikipedia text

Papers, articles, blog post

  • Benchmarks of some of polish NLP tools - Single-word lemmatization and morphological analysis, Multi-word lemmatization,Disambiguated POS tagging, Dependency parsing, Shallow parsing, Named entity recognition, Summarization etc.
  • Github Repo with list of polish: word embeddings and language models (Word2vec, fasttext, Glove, Elmo) - https://github.com/sdadas/polish-nlp-resources
  • Polish Word Embeddings Review - Evaluation of polish word embeddings: word2vec, fastext etc. prepared by various research groups. Evaluation is done by words analogy task.
  • Polish Sentence Evaluation- contains evaluation of eight sentence representation methods (Word2Vec, GloVe, FastText, ELMo, Flair, BERT, LASER, USE) on five polish linguistic tasks
  • TRAINING ROBERTA FROM SCRATCH - THE MISSING GUIDE - complete user guide for trainning Roberta model with use of Huggingface/Transformers for polish

Contribution

If you have or know valuable materials (datasets, models, posts, articles) that are missing here, please feel free to edit and submit a pull request. You can also send me a note on LinkedIn or via email:[email protected].