Bahasa-NLP-Tensorflow
Bahasa-NLP-Tensorflow copied to clipboard
Gathers Tensorflow deep learning models for Bahasa Malaysia NLP problems
Bahasa-NLP-Tensorflow, Gathers Tensorflow deep learning models for Bahasa Malaysia NLP problems, code simplify inside Jupyter Notebooks 100% including dataset.
Table of contents
- Augmentation
- Sparse classification
- Long-text classification
- Dependency Parsing
- English-Malay Translation
- Entity Tagging
- Abstractive Summarization
- Extractive Summarization
- POS Tagging
- Optical Character Recognition
- Question-Answer
- Semantic Similarity
- Speech to Text
- Stemming
- Topic Generator
- Text to Speech
- Topic Modeling
- Word Vector
Augmentation
- word2vec Malaya
Sparse classification
Trained on Tatoeba dataset.
- Fast-text Ngrams, test accuracy 88%
Normal-text classification
Trained on Bahasa subjectivity dataset.
- RNN LSTM + Bahdanau Attention, test accuracy 84%
- RNN LSTM + Luong Attention, test accuracy 82%
- Transfer-learning Multilanguage BERT, test accuracy 94.88%
70+ more models can get from here.
Long-text classification
Trained on Bahasa fakenews dataset.
- Dilated CNN, test accuracy 74%
- Wavenet, test accuracy 68%
- BERT Multilanguage, test accuracy 85%
- BERT-Bahasa Base, test accuracy 88%
Dependency Parsing
Trained on Bahasa dependency parsing dataset. 80% to train, 20% to test.
Accuracy based on arc, types and root accuracies after 10 epochs only.
- Bidirectional RNN + CRF + Biaffine, arc accuracy 60.64%, types accuracy 58.68%, root accuracy 89.03%
- Bidirectional RNN + Bahdanau + CRF + Biaffine, arc accuracy 60.51%, types accuracy 59.01%, root accuracy 88.99%
- Bidirectional RNN + Luong + CRF + Biaffine, arc accuracy 60.60%, types accuracy 59.06%, root accuracy 89.76%
- BERT Base + CRF + Biaffine, arc accuracy 58.55%, types accuracy 58.12%, root accuracy 88.87%
- Bidirectional RNN + Biaffine Attention + Cross Entropy, arc accuracy 69.53%, types accuracy 65.38%, root accuracy 90.71%
- BERT Base + Biaffine Attention + Cross Entropy, arc accuracy 77.03%, types accuracy 66.73%, root accuracy 88.38%
- XLNET Base + Biaffine Attention + Cross Entropy, arc accuracy 93.50%, types accuracy 92.48%, root accuracy 94.46%
English-Malay Translation
Trained on 100k english-malay dataset.
- Attention is All you need, train accuracy 19.09% test accuracy 20.38%
- BiRNN Seq2Seq Luong Attention, Beam decoder, train accuracy 45.2% test accuracy 37.26%
- Convolution Encoder Decoder, train accuracy 35.89% test accuracy 30.65%
- Dilated Convolution Encoder Decoder, train accuracy 82.3% test accuracy 56.72%
- Dilated Convolution Encoder Decoder Self-Attention, train accuracy 60.76% test accuracy 36.59%
Entity Tagging
Trained on Bahasa entity dataset.
- Bidirectional LSTM + CRF, test accuracy 95.10%
- Bidirectional LSTM + CRF + Bahdanau, test accuracy 94.34%
- Bidirectional LSTM + CRF + Luong, test accuracy 94.84%
- BERT Multilanguage, test accuracy 96.43%
- BERT-Bahasa Base, test accuracy 98.11%
- BERT-Bahasa Small, test accuracy 98.47%
- XLNET-Bahasa Base, test accuracy 98.008%
POS Tagging
Trained on Bahasa entity dataset.
- Bidirectional LSTM + CRF
- Bidirectional LSTM + CRF + Bahdanau
- Bidirectional LSTM + CRF + Luong
- Bert-Bahasa-Base + CRF, test accuracy 95.17%
- XLNET-Bahasa-Base + CRF, test accuracy 95.58%
Abstractive Summarization
Trained on Malaysia news dataset.
Accuracy based on ROUGE-2 after 20 epochs only.
- Dilated Seq2Seq, test accuracy 23.926%
- Pointer Generator + Bahdanau Attention, test accuracy 15.839%
- Pointer Generator + Luong Attention, test accuracy 26.23%
- Dilated Seq2Seq + Pointer Generator, test accuracy 20.382%
- BERT Multilanguage + Dilated CNN Seq2seq + Pointer Generator, test accuracy 23.7134%
Extractive Summarization
Trained on Malaysia news dataset.
- Skip-thought
- Residual Network + Bahdanau Attention
Optical Character Recognition
Trained on OCR Jawi to Malay
- CNN + LSTM RNN, test accuracy 63.86%
- Im2Latex, test accuracy 89.38%
Question-Answer
Trained on Bahasa QA dataset.
- End-to-End + GRU, test accuracy 89.17%
- Dynamic Memory + GRU, test accuracy 98.86%
Semantic Similarity
Trained on Translated Duplicated Quora question dataset.
- LSTM Bahdanau + Contrastive loss, test accuracy 79%
- Dilated CNN + Contrastive loss, test accuracy 77%
- Self-Attention + Contrastive loss, test accuracy 77%
- BERT + Cross entropy, test accuracy 83%
Speech to Text
Trained on Kamus speech dataset.
- BiRNN + LSTM + CTC Greedy, test accuracy 72.03%
- Wavenet, test accuracy 10.21%
- Deep speech 2, test accuracy 56.51%
- Dilated-CNN, test accuracy 59.31%
- Im2Latex, test accuracy 58.59%
Text to Speech
- Tacotron
- Seq2Seq + Bahdanau Attention
- Deep CNN + Monothonic Attention + Dilated CNN vocoder
Stemming
Trained on stemming dataset.
- Seq2seq + Beam decoder
- Seq2seq + Bahdanau Attention + Beam decoder
- Seq2seq + Luong Attention + Beam decoder
Topic Generator
Trained on Malaysia news dataset.
- TAT-LSTM, test accuracy 32.89%
- TAV-LSTM, test accuracy 40.69%
- MTA-LSTM, test accuracy 32.96%
Topic Modeling
- Lda2Vec
Word Vector
- word2vec
- ELMO
- Fast-text