ALBERT-Persian:
A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language

میتونی بهش بگی برت_کوچولو

Call it little_berty

ALBERT-Persian is the first attempt on ALBERT for the Persian Language. The model was trained based on Google's ALBERT BASE Version 2.0 over various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words, like the way we did for ParsBERT.

ALBERT-Persian Playground

Table of Contents:

Goals
- Base Config
Introduction
Results
- Sentiment Analysis (SA) Task
- Text Classification (TC) Task
- Named Entity Recognition (NER) Task
How to use
- Pytorch or TensorFlow 2.0
Models
- Base Config V2.0
  - Albert Model
- Base Config V1.0
  - Albert Model
  - Albert Sentiment Analysis
  - Albert Text Classification
  - Albert NER
NLP Tasks Tutorial :hugs:
Participants
Cite
Questions?
Releases
- Release v2.0 (Feb 17, 2021)
- Release v1.0 (Jul 30, 2020)

Goals

Base Config

Objective goals during training are as below (after 140K steps).

***** Eval results *****
global_step = 140000
loss = 2.0080082
masked_lm_accuracy = 0.6141017
masked_lm_loss = 1.9963315
sentence_order_accuracy = 0.985
sentence_order_loss = 0.06908702

Introduction

ALBERT-Persian trained on a massive amount of public corpora (Persian Wikidumps, MirasText) and six other manually crawled text data from a various type of websites (BigBang Page scientific, Chetor lifestyle, Eligasht itinerary, Digikala digital magazine, Ted Talks general conversational, Books novels, storybooks, short stories from old to the contemporary era).

Results

The following tables summarize the F1 scores obtained by ALBERT-Persian as compared to other models and architectures.

Sentiment Analysis (SA) Task

Dataset	ALBERT-fa-base-v2	ParsBERT-v1	mBERT	DeepSentiPers
Digikala User Comments	81.12	81.74	80.74	-
SnappFood User Comments	85.79	88.12	87.87	-
SentiPers (Multi Class)	66.12	71.11	-	69.33
SentiPers (Binary Class)	91.09	92.13	-	91.98

Text Classification (TC) Task

Dataset	ALBERT-fa-base-v2	ParsBERT-v1	mBERT
Digikala Magazine	92.33	93.59	90.72
Persian News	97.01	97.19	95.79

Named Entity Recognition (NER) Task

Dataset	ALBERT-fa-base-v2	ParsBERT-v1	mBERT	MorphoBERT	Beheshti-NER	LSTM-CRF	Rule-Based CRF	BiLSTM-CRF
PEYMA	88.99	93.10	86.64	-	90.59	-	84.00	-
ARMAN	97.43	98.79	95.89	89.9	84.03	86.55	-	77.45

If you tested ALBERT-Persian on a public dataset and you want to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so we can add it as a reference

How to use

for using any type of Albert you have to install sentencepiece
run this in your notebook !pip install -q sentencepiece

Pytorch or TensorFlow 2.0

from transformers import AutoConfig, AutoTokenizer
from transformers import AutoModelForMaskedLM  # for pytorch
from transformers import TFAutoModelForMaskedLM  # for tensorflow

config = AutoConfig.from_pretrained("HooshvareLab/albert-fa-zwnj-base-v2")
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/albert-fa-zwnj-base-v2")

# for pytorch
model = AutoModelForMaskedLM.from_pretrained("HooshvareLab/albert-fa-zwnj-base-v2")

# for tensorflow
# model = TFAutoModelForMaskedLM.from_pretrained("HooshvareLab/albert-fa-zwnj-base-v2")

text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد می‌توانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
tokenizer.tokenize(text)

>>> Tokenized:
 ▁ما
▁در
▁هوش
واره
▁معتقدیم
▁با
▁انتقال
▁صحیح
▁دانش
▁و
▁
ا
گاهی
،
▁همه
▁افراد
▁می
[ZWNJ]
توانند
▁از
▁ابزارهای
▁هوشمند
▁استفاده
▁کنند
.
▁شعار
▁ما
▁هوش
▁مصنوعی
▁برای
▁همه
▁است
.

Models

Base Config V2.0

Albert Model

HooshvareLab/albert-fa-zwnj-base-v2

Base Config V1.0

Albert Model

m3hrdadfi/albert-face-base-v2

Albert Sentiment Analysis

Albert Text Classification

Albert NER

NLP Tasks Tutorial :hugs:

Notebook	Description
Text Classification	...	soon
Sentiment Analysis	...	soon
Named Entity Recognition	...	soon
Text Generation	...	soon

See also the list of contributors who participated in this project.

Participants

See also the list of contributors who participated in this project.

Cite

I didn't publish any paper about this work, yet! Please cite in your publication as the following:

@misc{ALBERTPersian,
  author = {Hooshvare Team},
  title = {ALBERT-Persian: A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/m3hrdadfi/albert-persian}},
}

Questions?

Post a Github issue on the ALBERT-Persian repo.

Releases

Release v2.0 (Feb 17, 2021)

This version able to tackle the zero-width non-joiner character in favor of Persian writing.

Release v1.0 (Jul 30, 2020)

This is the first version of ALBERT-Persian Base!

albert-persian
albert-persian copied to clipboard

Metadata

ALBERT-Persian:
A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language

Goals

Base Config

Introduction

Results

Sentiment Analysis (SA) Task

Text Classification (TC) Task

Named Entity Recognition (NER) Task

How to use

Pytorch or TensorFlow 2.0

Models

Base Config V2.0

Albert Model

Base Config V1.0

Albert Model

Albert Sentiment Analysis

Albert Text Classification

Albert NER

NLP Tasks Tutorial :hugs:

Participants

Cite

Questions?

Releases

Release v2.0 (Feb 17, 2021)

Release v1.0 (Jul 30, 2020)

← Metadata

Owner

Metadata

albert-persian albert-persian copied to clipboard

Metadata

ALBERT-Persian: A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language

Goals

Base Config

Introduction

Results

Sentiment Analysis (SA) Task

Text Classification (TC) Task

Named Entity Recognition (NER) Task

How to use

Pytorch or TensorFlow 2.0

Models

Base Config V2.0

Albert Model

Base Config V1.0

Albert Model

Albert Sentiment Analysis

Albert Text Classification

Albert NER

NLP Tasks Tutorial :hugs:

Participants

Cite

Questions?

Releases

Release v2.0 (Feb 17, 2021)

Release v1.0 (Jul 30, 2020)

← Metadata

Owner

Metadata

albert-persian
albert-persian copied to clipboard

ALBERT-Persian:
A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language