polish-nlp-resources
polish-nlp-resources copied to clipboard
Pre-trained models and language resources for Natural Language Processing in Polish
Polish NLP resources
This repository contains pre-trained models and language resources for Natural Language Processing in Polish created during my research. Some of the models are also available on Huggingface Hub.
If you'd like to use any of those resources in your research please cite:
@Misc{polish-nlp-resources,
author = {S{\l}awomir Dadas},
title = {A repository of Polish {NLP} resources},
howpublished = {Github},
year = {2019},
url = {https://github.com/sdadas/polish-nlp-resources/}
}
Contents
- Word embeddings
- Language models
- Sentence encoders
- Machine translation models
- Dictionaries and lexicons
- Links to external resources
Word embeddings
The following section includes pre-trained word embeddings for Polish. Each model was trained on a corpus consisting of Polish Wikipedia dump, Polish books and articles, 1.5 billion tokens at total.
Word2Vec
Word2Vec trained with Gensim. 100 dimensions, negative sampling, contains lemmatized words with 3 or more ocurrences in the corpus and additionally a set of pre-defined punctuation symbols, all numbers from 0 to 10'000, Polish forenames and lastnames. The archive contains embedding in gensim binary format. Example of usage:
from gensim.models import KeyedVectors
if __name__ == '__main__':
word2vec = KeyedVectors.load("word2vec_polish.bin")
print(word2vec.similar_by_word("bierut"))
# [('cyrankiewicz', 0.818274736404419), ('gomułka', 0.7967918515205383), ('raczkiewicz', 0.7757788896560669), ('jaruzelski', 0.7737460732460022), ('pużak', 0.7667238712310791)]
Download (Google Drive) or Download (GitHub)
FastText
FastText trained with Gensim. Vocabulary and dimensionality is identical to Word2Vec model. The archive contains embedding in gensim binary format. Example of usage:
from gensim.models import KeyedVectors
if __name__ == '__main__':
word2vec = KeyedVectors.load("fasttext_100_3_polish.bin")
print(word2vec.similar_by_word("bierut"))
# [('bieruty', 0.9290274381637573), ('gierut', 0.8921363353729248), ('bieruta', 0.8906412124633789), ('bierutow', 0.8795544505119324), ('bierutowsko', 0.839280366897583)]
Download (Google Drive) (v2, trained with Gensim 3.8.0)
Download (Google Drive) (v1, trained with Gensim 3.5.0, DEPRECATED)
GloVe
Global Vectors for Word Representation (GloVe) trained using the reference implementation from Stanford NLP. 100 dimensions, contains lemmatized words with 3 or more ocurrences in the corpus. Example of usage:
from gensim.models import KeyedVectors
if __name__ == '__main__':
word2vec = KeyedVectors.load_word2vec_format("glove_100_3_polish.txt")
print(word2vec.similar_by_word("bierut"))
# [('cyrankiewicz', 0.8335597515106201), ('gomułka', 0.7793121337890625), ('bieruta', 0.7118682861328125), ('jaruzelski', 0.6743760108947754), ('minc', 0.6692837476730347)]
Download (Google Drive) or Download (GitHub)
High dimensional word vectors
Pre-trained vectors using the same vocabulary as above but with higher dimensionality. These vectors are more suitable for representing larger chunks of text such as sentences or documents using simple word aggregation methods (averaging, max pooling etc.) as more semantic information is preserved this way.
GloVe - 300d: Part 1 (GitHub), 500d: Part 1 (GitHub) Part 2 (GitHub), 800d: Part 1 (GitHub) Part 2 (GitHub) Part 3 (GitHub)
Word2Vec - 300d (OneDrive), 500d (OneDrive), 800d (OneDrive)
FastText - 300d (OneDrive), 500d (OneDrive), 800d (OneDrive)
Compressed Word2Vec
This is a compressed version of the Word2Vec embedding model described above. For compression, we used the method described in Compressing Word Embeddings via Deep Compositional Code Learning by Shu and Nakayama. Compressed embeddings are suited for deployment on storage-poor devices such as mobile phones. The model weights 38MB, only 4.4% size of the original Word2Vec embeddings. Although the authors of the article claimed that compressing with their method doesn't hurt model performance, we noticed a slight but acceptable drop of accuracy when using compressed version of embeddings. Sample decoder class with usage:
import gzip
from typing import Dict, Callable
import numpy as np
class CompressedEmbedding(object):
def __init__(self, vocab_path: str, embedding_path: str, to_lowercase: bool=True):
self.vocab_path: str = vocab_path
self.embedding_path: str = embedding_path
self.to_lower: bool = to_lowercase
self.vocab: Dict[str, int] = self.__load_vocab(vocab_path)
embedding = np.load(embedding_path)
self.codes: np.ndarray = embedding[embedding.files[0]]
self.codebook: np.ndarray = embedding[embedding.files[1]]
self.m = self.codes.shape[1]
self.k = int(self.codebook.shape[0] / self.m)
self.dim: int = self.codebook.shape[1]
def __load_vocab(self, vocab_path: str) -> Dict[str, int]:
open_func: Callable = gzip.open if vocab_path.endswith(".gz") else open
with open_func(vocab_path, "rt", encoding="utf-8") as input_file:
return {line.strip():idx for idx, line in enumerate(input_file)}
def vocab_vector(self, word: str):
if word == "<pad>": return np.zeros(self.dim)
val: str = word.lower() if self.to_lower else word
index: int = self.vocab.get(val, self.vocab["<unk>"])
codes = self.codes[index]
code_indices = np.array([idx * self.k + offset for idx, offset in enumerate(np.nditer(codes))])
return np.sum(self.codebook[code_indices], axis=0)
if __name__ == '__main__':
word2vec = CompressedEmbedding("word2vec_100_3.vocab.gz", "word2vec_100_3.compressed.npz")
print(word2vec.vocab_vector("bierut"))
Download (Google Drive) or Download (GitHub)
Language models
ELMo
Embeddings from Language Models (ELMo) is a contextual embedding presented in Deep contextualized word representations by Peters et al. Sample usage with PyTorch below, for a more detailed instructions for integrating ELMo with your model please refer to the official repositories github.com/allenai/bilm-tf (Tensorflow) and github.com/allenai/allennlp (PyTorch).
from allennlp.commands.elmo import ElmoEmbedder
elmo = ElmoEmbedder("options.json", "weights.hdf5")
print(elmo.embed_sentence(["Zażółcić", "gęślą", "jaźń"]))
Download (Google Drive) or Download (GitHub)
RoBERTa
Language model for Polish based on popular transformer architecture. We provide weights for improved BERT language model introduced in RoBERTa: A Robustly Optimized BERT Pretraining Approach. We provide two RoBERTa models for Polish - base and large model. A summary of pre-training parameters for each model is shown in the table below. We release two version of the each model: one in the Fairseq format and the other in the HuggingFace Transformers format. More information about the models can be found in a separate repository.
| Model | L / H / A* | Batch size | Update steps | Corpus size | Fairseq | Transformers |
|---|---|---|---|---|---|---|
| RoBERTa (base) | 12 / 768 / 12 | 8k | 125k | ~20GB | v0.9.0 | v3.4 |
| RoBERTa‑v2 (base) | 12 / 768 / 12 | 8k | 400k | ~20GB | v0.10.1 | v4.4 |
| RoBERTa (large) | 24 / 1024 / 16 | 30k | 50k | ~135GB | v0.9.0 | v3.4 |
| RoBERTa‑v2 (large) | 24 / 1024 / 16 | 2k | 400k | ~200GB | v0.10.2 | v4.14 |
| DistilRoBERTa | 6 / 768 / 12 | 1k | 10ep. | ~20GB | n/a | v4.13 |
* L - the number of encoder blocks, H - hidden size, A - the number of attention heads
Example in Fairseq:
import os
from fairseq.models.roberta import RobertaModel, RobertaHubInterface
from fairseq import hub_utils
model_path = "roberta_large_fairseq"
loaded = hub_utils.from_pretrained(
model_name_or_path=model_path,
data_name_or_path=model_path,
bpe="sentencepiece",
sentencepiece_vocab=os.path.join(model_path, "sentencepiece.bpe.model"),
load_checkpoint_heads=True,
archive_map=RobertaModel.hub_models(),
cpu=True
)
roberta = RobertaHubInterface(loaded['args'], loaded['task'], loaded['models'][0])
roberta.eval()
roberta.fill_mask('Druga wojna światowa zakończyła się w <mask> roku.', topk=1)
roberta.fill_mask('Ludzie najbardziej boją się <mask>.', topk=1)
#[('Druga wojna światowa zakończyła się w 1945 roku.', 0.9345270991325378, ' 1945')]
#[('Ludzie najbardziej boją się śmierci.', 0.14140743017196655, ' śmierci')]
It is recommended to use the above models, but it is still possible to download our old model, trained on smaller batch size (2K) and smaller corpus (15GB).
BART
BART is a transformer-based sequence to sequence model trained with a denoising objective. Can be used for fine-tuning on prediction tasks, just like regular BERT, as well as various text generation tasks such as machine translation, summarization, paraphrasing etc. We provide a Polish version of BART base model, trained on a large corpus of texts extracted from Common Crawl (200+ GB). More information on the BART architecture can be found in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Example in HugginFace Transformers:
import os
from transformers import BartForConditionalGeneration, PreTrainedTokenizerFast
model_dir = "bart_base_transformers"
tok = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BartForConditionalGeneration.from_pretrained(model_dir)
sent = "Druga<mask>światowa zakończyła się w<mask>roku kapitulacją hitlerowskich<mask>"
batch = tok(sent, return_tensors='pt')
generated_ids = model.generate(batch['input_ids'])
print(tok.batch_decode(generated_ids, skip_special_tokens=True))
# ['Druga wojna światowa zakończyła się w 1945 roku kapitulacją hitlerowskich Niemiec.']
Download for Fairseq v0.10 or HuggingFace Transformers v4.0.
GPT-2
GPT-2 is a unidirectional transformer-based language model trained with an auto-regressive objective, originally introduced in the Language Models are Unsupervised Multitask Learners paper. The original English GPT-2 was released in four sizes differing by the number of parameters: small (112M), medium (345M), large (774M), xl (1.5B). We provide Polish versions of the medium and large GPT-2 models. Example in Fairseq:
import os
from fairseq import hub_utils
from fairseq.models.transformer_lm import TransformerLanguageModel
model_dir = "gpt2_medium_fairseq"
loaded = hub_utils.from_pretrained(
model_name_or_path=model_dir,
checkpoint_file="model.pt",
data_name_or_path=model_dir,
bpe="hf_byte_bpe",
bpe_merges=os.path.join(model_dir, "merges.txt"),
bpe_vocab=os.path.join(model_dir, "vocab.json"),
load_checkpoint_heads=True,
archive_map=TransformerLanguageModel.hub_models()
)
model = hub_utils.GeneratorHubInterface(loaded["args"], loaded["task"], loaded["models"])
model.eval()
result = model.sample(
["Policja skontrolowała trzeźwość kierowców"],
beam=5, sampling=True, sampling_topk=50, sampling_topp=0.95,
temperature=0.95, max_len_a=1, max_len_b=100, no_repeat_ngram_size=3
)
print(result[0])
# Policja skontrolowała trzeźwość kierowców pojazdów. Wszystko działo się na drodze gminnej, między Radwanowem
# a Boguchowem. - Około godziny 12.30 do naszego komisariatu zgłosił się kierowca, którego zaniepokoiło
# zachowanie kierującego w chwili wjazdu na tą drogę. Prawdopodobnie nie miał zapiętych pasów - informuje st. asp.
# Anna Węgrzyniak z policji w Brzezinach. Okazało się, że kierujący był pod wpływem alkoholu. [...]
Download medium or large model for Fairseq v0.10.
Longformer
One of the main constraints of standard Transformer architectures is the limitation on the number of input tokens. There are several known models that allow processing of long documents, one of the popular ones being Longformer, introduced in the paper Longformer: The Long-Document Transformer. We provide base and large versions of Polish Longformer model. The models were initialized with Polish RoBERTa (v2) weights and then fine-tuned on a corpus of long documents, ranging from 1024 to 4096 tokens. Example in Huggingface Transformers:
from transformers import pipeline
fill_mask = pipeline('fill-mask', model='sdadas/polish-longformer-base-4096')
fill_mask('Stolica oraz największe miasto Francji to <mask>.')
Base and large models are available on Huggingface Hub
Sentence encoders
Polish transformer-based sentence encoders
The purpose of sentence encoders is to produce a fixed-length vector representation for chunks of text, such as sentences or paragraphs. These models are used in semantic search, question answering, document clustering, dataset augmentation, plagiarism detection, and other tasks which involve measuring semantic similarity between sentences. We share two models based on the Sentence-Transformers library, trained using distillation method described in the paper Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. A corpus of 100 million parallel Polish-English sentence pairs from the OPUS project was used to train the models. You can download them from the Hugginface Hub using the links below.
| Student model | Teacher model | Download |
|---|---|---|
| polish-roberta-base-v2 | paraphrase-distilroberta-base-v2 | st-polish-paraphrase-from-distilroberta |
| polish-roberta-base-v2 | paraphrase-mpnet-base-v2 | st-polish-paraphrase-from-mpnet |
A simple example in Sentence-Transformers library:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ["Bardzo lubię jeść słodycze.", "Uwielbiam zajadać się słodkościami."]
model = SentenceTransformer("sdadas/st-polish-paraphrase-from-mpnet")
results = model.encode(sentences, convert_to_tensor=True, show_progress_bar=False)
print(cos_sim(results[0], results[1]))
# tensor([[0.9794]], device='cuda:0')
Machine translation models
This section includes pre-trained machine translation models.
Polish-English and English-Polish convolutional models for Fairseq
We provide Polish-English and English-Polish convolutional neural machine translation models trained using Fairseq sequence modeling toolkit. Both models were trained on a parallel corpus of more than 40 million sentence pairs taken from Opus collection. Example of usage (fairseq, sacremoses and subword-nmt python packages are required to run this example):
from fairseq.models import BaseFairseqModel
model_path = "/polish-english/"
model = BaseFairseqModel.from_pretrained(
model_name_or_path=model_path,
checkpoint_file="checkpoint_best.pt",
data_name_or_path=model_path,
tokenizer="moses",
bpe="subword_nmt",
bpe_codes="code",
cpu=True
)
print(model.translate(sentence="Zespół astronomów odkrył w konstelacji Panny niezwykłą planetę.", beam=5))
# A team of astronomers discovered an extraordinary planet in the constellation of Virgo.
Polish-English convolutional model: Download (GitHub)
English-Polish convolutional model: Download (GitHub)
Dictionaries and lexicons
Polish, English and foreign person names
This lexicon contains 346 thousand forenames and lastnames labeled as Polish, English or Foreign (other) crawled from multiple Internet sources.
Possible labels are: P-N (Polish forename), P-L (Polish lastname), E-N (English forename), E-L (English lastname), F (foreign / other).
For each word, there is an additional flag indicating whether this name is also used as a common word in Polish (C for common, U for uncommon).
Download (GitHub)
Named entities extracted from SJP.PL
This dictionary consists mostly of the names of settlements, geographical regions, countries, continents and words derived from them (relational adjectives and inhabitant names). Besides that, it also contains names of popular brands, companies and common abbreviations of institutions' names. This resource was created in a semi-automatic way, by extracting the words and their forms from SJP.PL using a set of heuristic rules and then manually filtering out words that weren't named entities.
Download (GitHub)
Links to external resources
Repositories of linguistic tools and resources
- Computational Linguistics in Poland - IPI PAN
- G4.19 Research Group, Wroclaw University of Technology
- CLARIN - repository of linguistic resources
- Gonito.net - evaluation platform with some challenges for Polish
- Awesome NLP Polish (ksopyla)
Publicly available large Polish text corpora (> 1GB)
- OSCAR Corpus (Common Crawl extract)
- CC-100 Web Crawl Data (Common Crawl extract)
- The Polish Parliamentary Corpus
- Redistributable subcorpora of the National Corpus of Polish
- Polish Wikipedia Dumps
- OPUS Parallel Corpora
- Corpus from PolEval 2018 Language Modeling Task
Models supporting Polish language
Sentence analysis (tokenization, lemmatization, POS tagging etc.)
- Stanza - A collection of neural NLP models for many languages from StndordNLP.
- Trankit - A light-weight transformer-based python toolkit for multilingual natural language processing by the University of Oregon.
- KRNNT and KFTT - Neural morphosyntactic taggers for Polish.
- Morfeusz - A classic Polish morphosyntactic tagger.
- Language Tool - Java-based open source proofreading software for many languages with sentence analysis tools included.
- Stempel - Algorythmic stemmer for Polish.
Machine translation
- Marian-NMT - An efficient C++ based implementation of neural translation models. Many pre-trained models are available, including those supporting Polish: pl-de, pl-en, pl-es, pl-fr, pl-sv, de-pl, es-pl, fr-pl.
- M2M (2021) - A single massive machine translation architecture supporting direct translation for any pair from the list of 100 languages. Details in the paper Beyond English-Centric Multilingual Machine Translation.
- mBART-50 (2021) - A multilingual BART model fine-tuned for machine translation in 50 languages. Three machine translation models were published: many-to-many, English-to-many, and many-to-English. For more information see Multilingual Translation with Extensible Multilingual Pretraining and Finetuning.
- NLLB (2022) - NLLB (No Language Left Behind) is a project by Meta AI aiming to provide machine translation models for over 200 languages. A set of multilingual neural models ranging from 600M to 54.5B parameters is available for download. For more details see No Language Left Behind: Scaling Human-Centered Machine Translation.
Language models
- Multilingual BERT (2018) - BERT (Bidirectional Encoder Representations from Transformers) is a model for generating contextual word representations. Multilingual cased model provided by Google supports 104 languages including Polish.
- XLM-RoBERTa (2019) - Cross lingual sentence encoder trained on 2.5 terabytes of data from CommonCrawl and Wikipedia. Supports 100 languages including Polish. See Unsupervised Cross-lingual Representation Learning at Scale for details.
- Slavic BERT (2019) - Multilingual BERT model supporting Bulgarian (bg), Czech (cs), Polish (pl) and Russian (ru) languages.
- mT5 (2020) - Google's text-to-text transformer for 101 languages based on the T5 architecture. Details in the paper mT5: A massively multilingual pre-trained text-to-text transformer.
- HerBERT (2020) - Polish BERT-based language model trained by Allegro for HuggingFace Transformers in base and large variant.
- plT5 (2021) - Polish version of the T5 model available in small, base and large sizes.
- XLM-RoBERTa-XL and XXL (2021) - Large-scale versions of XLM-RoBERTa models with 3.5 and 10.7 billion parameters respectively. For more information see Larger-Scale Transformers for Multilingual Masked Language Modeling.
- mLUKE (2021) - A multilingual version of LUKE, Transformer-based language model enriched with entity metadata. The model supports 24 languages including Polish. For more information see mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models.
- XGLM (2021) - A GPT style autoregressive Transformer language model trained on a large-scale multilingual corpus. The model was published in several sizes, but only the 4.5B variant includes Polish language. For more information see Few-shot Learning with Multilingual Language Models.
- PapuGaPT2 (2021) - Polish GPT-like autoregressive models available in base and large sizes.
- mGPT (2022) - Another multilingual GPT style model with 1.3B parameters, covering 60 languages. The model has been trained by Sberbank AI. For more information see mGPT: Few-Shot Learners Go Multilingual.
Sentence encoders
- Universal Sentence Encoder (2019) - USE (Universal Sentence Encoder) generates sentence level langauge representations. Pre-trained multilingual model supports 16 langauges (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian).
- LASER Language-Agnostic SEntence Representations (2019) - A multilingual sentence encoder by Facebook Research, supporting 93 languages.
- LaBSE (2020) - Language-agnostic BERT sentence embedding model supporting 109 languages. See Language-agnostic BERT Sentence Embedding for details.
- Sentence Transformers (2020) - Sentence-level models based on the transformer architecture. The library includes multilingual models supporting Polish. More information on multilingual knowledge distillation method used by the authors can be found in Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation.
- LASER2 and LASER3 (2022) - New versions of the LASER sentence encoder by Meta AI, developed as a part of the NLLB (No Language Left Behind) project. LASER2 supports the same set of languages as the first version of the encoder, which includes Polish. LASER3 adds support to less common languages, mostly low-resource African languages. See Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages for more details.
Optical character recognition (OCR)
- Easy OCR - Optical character recognition toolkit with pre-trained models for over 40 languages, including Polish.
- Tesseract - Popular OCR software developed since 1980s, supporting over 100 languages. For integration with Python, wrappers such as PyTesseract or OCRMyPDF can be used.
Multimodal models
- Multilingual CLIP (SBert) (2021) - CLIP (Contrastive Language-Image Pre-Training) is a neural network introducted by OpenAI which enables joint vector representations for images and text. It can be used for building image search engines. This is a multilingual version of CLIP trained by the authors of the Sentence-Transformers library.
- Multilingual CLIP (M-CLIP) (2021) - This is yet another multilingual version of CLIP supporting Polish language, trained by the Swedish Institute of Computer Science (SICS).
- LayoutXLM (2021) - A multilingual version of LayoutLMv2 model, pre-trained on 30 million documents in 53 languages. The model combines visual, spatial, and textual modalities to solve prediction problems on visually-rich documents, such as PDFs or DOCs. See LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding and LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding for details.