sbnltk
sbnltk copied to clipboard
Bangla NLP toolkit. Bangla NER, POStag, Stemmer, Word embedding, sentence embedding, summarization, preprocessor, sentiment analysis, etc.
pypi-download-stats
Please use colab for getting no problem. For transformer model, please install simpleTransformer first or use bn_nlp for static models. I uploaded dataset and training details in my github. There is a problem in sentiment analyzer. I Will fix it soon.
SBNLTK
SUST-Bangla Natural Language toolkit. A python module for Bangla NLP tasks.
Demo Version : 2.0.2
NEED python 3.6+ vesrion!! Use virtual Environment for not getting unessessary Issues!!
INSTALLATION
PYPI INSTALLATION
pip3 install sbnltk
pip3 install simpletransformers
pip3 install fasttext
pip3 install scikit-learn
MANUAL INSTALLATION FROM GITHUB
- Clone this project
- Install all the requirements
- Call the setup.py from terminal
What will you get here?
- Bangla Text Preprocessor
- Bangla word dust,punctuation,stop word removal
- Bangla word sorting according to Bangla or English alphabet
- Bangla word normalization
- Bangla word stemmer
- Bangla Sentiment analysis(logisticRegression,LinearSVC,Multilnomial_naive_bayes,Random_Forst)
- Bangla Sentiment analysis with Bert
- Bangla sentence pos tagger (static, sklearn)
- Bangla sentence pos tagger with BERT(Multilingual-cased,Multilingual uncased)
- Bangla sentence NER(Static,sklearn)
- Bangla sentence NER with BERT(Bert-Cased, Multilingual Cased/Uncased)
- Bangla word word2vec(gensim,glove,fasttext)
- Bangla sentence embedding(Contexual,Transformer/Bert)
- Bangla Document Summarization(Feature based, Contexual, sementic Based)
- Bangla Bi-lingual project(Bangla to english google translator without blocking IP)
- Bangla document information Extraction
SEE THE CODE DOCS FOR USES!
TASKS, MODELS, ACCURACY, DATASET AND DOCS
TASK | MODEL | ACCURACY | DATASET | About | Code DOCS |
---|---|---|---|---|---|
Preprocessor | Punctuation, Stop Word, DUST removal Word normalization, others.. | ------ | ----- | docs | |
Word tokenizers | basic tokenizers Customized tokenizers | ---- | ---- | docs | |
Sentence tokenizers | Basic tokenizers Customized tokenizers Sentence Cluster | ----- | ----- | docs | |
Stemmer | StemmerOP | 85.5% | ---- | docs | |
Sentiment Analysis | logisticRegression | 88.5% | 20,000+ | docs | |
LinearSVC | 82.3% | 20,000+ | docs | ||
Multilnomial_naive_bayes | 84.1% | 20,000+ | docs | ||
Random Forest | 86.9% | 20,000+ | docs | ||
BERT | 93.2% | 20,000+ | docs | ||
POS tagger | Static method | 55.5% | 1,40,973 words | docs | |
SK-LEARN classification | 81.2% | 6,000+ sentences | docs | ||
BERT-Multilingual-Cased | 69.2% | 6,000+ | docs | ||
BERT-Multilingual-Uncased | 78.7% | 6,000+ | docs | ||
NER tagger | Static method | 65.3% | 4,08,837 Entity | docs | |
SK-LEARN classification | 81.2% | 65,000+ | docs | ||
BERT-Cased | 79.2% | 65,000+ | docs | ||
BERT-Mutilingual-Cased | 75.5% | 65,000+ | docs | ||
BERT-Multilingual-Uncased | 90.5% | 65,000+ | docs | ||
Word Embedding | Gensim-word2vec-100D- 1,00,00,000+ tokens | - | 2,00,00,000+ sentences | docs | |
Glove-word2vec-100D- 2,30,000+ tokens | - | 5,00,000 sentences | docs | ||
fastext-word2vec-200D 3,00,000+ | - | 5,00,000 sentences | docs | ||
Sentence Embedding | Contextual sentence embedding | - | ----- | docs | |
Transformer embedding_hd | - | 3,00,000+ human data | docs | ||
Transformer embedding_gd | - | 3,00,000+ google data | docs | ||
Extractive Summarization | Feature-based based | 70.0% f1 score | ------ | docs | |
Transformer sentence sentiment Based | 67.0% | ------ | docs | ||
Word2vec--sentences contextual Based | 60.0% | ----- | docs | ||
Bi-lingual projects | google translator with large data detector | ---- | ---- | docs | |
Information Extraction | Static word features | - | docs | ||
Semantic and contextual | - | docs | |||
Bangla Coreference Resolution | - |
Next releases after testing this demo
Task | Version |
---|---|
Coreference Resolution | v1.1 |
Language translation | V1.1 |
Masked Language model | V1.1 |
Information retrieval Projects | V1.1 |
Entity Segmentation | v1.3 |
Factoid Question Answering | v1.2 |
Question Classification | v1.2 |
sentiment Word embedding | v1.3 |
So many others features | --- |
Package Installation
You have to install these packages manually, if you get any module error.
- simpletransformers
- fasttext
Models
Everything is automated here. when you call a model for the first time, it will be downloaded automatically.
With GPU or Without GPU
- With GPU, you can run any models without getting any warnings.
- Without GPU, You will get some warnings. But this will not affect in result.
Motivation
With approximately 228 million native speakers and another 37 million as second language speakers,Bengali is the fifth most-spoken native language and the seventh most spoken language by total number of speakers in the world. But still it is a low resource language. Why?
Dataset
For all sbnltk dataset and existing Dataset, see this link Bangla NLP Dataset
Trainer
For training, You can see this Colab Trainer . In future i will make a Trainer module!
When will full version come?
Very soon. We are working on paper and improvement our modules. It will be released sequentially.
About accuracy
Accuracy can be varied for the different datasets. We measure our model with random datasets but small scale. As human resources for this project are not so large.
Contribute Here
- If you found any issue, please create an issue or contact with me.