Arabic-News-Article-Classification
Arabic-News-Article-Classification copied to clipboard
Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.
Arabic News Article Classification
Based on: Building TALAA, a Free General and Categorized Arabic Corpus
University of Science and Technology Houari Boumediene, Algiers, Algeria
Corpus
"The TALAA corpus is a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles." [1]
Description of the TALAA corpus [1] :
Features | Corpora |
---|---|
Nb. of articles | 57.827 |
Nb. of categories | 8 |
Nb. of words | 14.068.407 |
Nb. of types | 582.531 |
Nb. of tokens | 15.891.729 |
The corpus is distributed on 8 categories [1] :
Category | Nb. of articles |
---|---|
Culture | 5322 |
Economic | 8768 |
Politics | 9620 |
Religion | 4526 |
Society | 9744 |
Sports | 9103 |
World | 6344 |
Other | 4400 |
Pre-processing
The following data pre-processing steps have been performed:
0.Example:
أمرت السلطات القطرية الأسواق و المراكز التجارية في البلاد برفع و إزالة السلع الواردة من السعودية و البحرين و الإمارات و مصر في الذكرى الأولى لإعلان هذه الدول الحصار عليها.
1.Tokenization
Each collected article was segmented into tokens, using NLTK.
[ أمرت, السلطات, القطرية, الأسواق, و, المراكز, التجارية, في, البلاد, ب, رفع, و, إزالة, السلع, الواردة, من, السعودية, و, البحرين, و, الإمارات, و, مصر, في, الذكرى, الأولى, ل, إعلان, هذه, الدول, الحصار, عليها, . ]
2.Removing stopwords
Tokenized text was cleaned from stopwords. There's a complete and reviewed list here, It contains 750 stop words.
[ أمرت, السلطات, القطرية, الأسواق, المراكز, التجارية, البلاد, رفع, إزالة, السلع, الواردة, السعودية, البحرين, الإمارات, مصر, الذكرى, الأول, إعلان, الدول, الحصار ]
3.Stemming
Each word was stemmed using Farasa Arabic text processing toolkit.
[ أمر, سلطة, قطر, سوق, مركز, تجاري, بلد, رفع, إزالة, سلعة, وارد, سعودية, بحرين, إمارات, مصر, ذكرى, أول, إعلان, دولة, حصار ]
Dataset
Categories = {الجزائر : Algeria, الثقافة : entertainment, الدين : religion, المجتمع : society, الرياضة : sport, العالم : world}

Machine Learning Models
Many Machine Learning algorithms has been experimented:
Algorithm | Precision | Recall | F-mesure |
---|---|---|---|
Decision Tree | 0.82 | 0.84 | 0.83 |
SVM (SGD) | 0.94 | 0.94 | 0.94 |
Naive Bayes | 0.89 | 0.87 | 0.88 |
Evaluation (Confusion matrix)
Confusion matrix using the best model SVM with Stochastic Gradient Descent:

TODO
Contributing
Credits
- Team mate: Fawzi TOUATI
- Initial idea and mentor: Pr. Ahmed GUESSOUM
- Mentor: Dr. Riadh BELKEBIR