malay-fake-news-classification
malay-fake-news-classification copied to clipboard
Malay Fake News Classification using CNN, BiLSTM, C-LSTM, RCNN, FT-BERT and BERTCNN.
malay-fake-news-classification
Malay Fake News Classification using: 1. CNN [3] 2. BiLSTM [4] 3. C-LSTM [5] 4. RCNN [6] 5. FT-BERT [7] 6. BERTCNN (A unique method in this project that uses the sequence output from the last BERT layer to be provided to CNN layers).
The preprocessed Word2Vec collaterals which Item 1 - 4 heavily depended on can be obtained via: https://www.dropbox.com/s/pm9rrynspp16det/malay_word2vec.zip?dl=0 Please see my "malay-word2vec-tsne" repo to see how they are preprocessed.
The result of this project produced a filtered Malay fake news dataset which can be downloaded from malaya_fake_news_preprocessed_dataframe.pkl (available via the link in malay-fake-news-dataset.txt or at https://www.dropbox.com/s/i5yx6e426m8frgs/malaya_fake_news_preprocessed_dataframe.pkl?dl=0). The news articles from the original dataset [1] that cannot be correctly classified by all the models are treated as outliers and filtered out.
The following command in Python will load and display the dataset:
import pandas as pd
df_allnews_unpickled = pd.read_pickle("./malaya_fake_news_preprocessed_dataframe.pkl")
df_allnews_unpickled
Column descriptions: news: Original news articles that have been cleaned minimally - lowercased, added space between specific symbols, "hb" & "th" e.g. 4th/13hb -> 4 th/ 13 hb. tokens: Tokenized words from news column. Changed numbers from digits to ordinal spellings. (See image above) rejoined: Rejoined sentences from tokens column. Mostly used for BERT models as they have their own tokenizer. length: Length of sentences based on tokens. label: Class label. 1 for real news. 0 for fake news. real: One-hot encoding column for real news. fake: One-hot encoding column for fake news. Further information regarding this dataset can be found in the following table.
The following experiments/modifications were done before filtering the outliers to achieve the best result/dataset:
- Normal: All fake news articles originally from [1] are considered.
- <1000: Only news articles with less than 1000 words are considered because those with more are very few in numbers.
- Trunc128: All news articles are truncated to have a maximum sequence length of 128 (the standard for BERT models in this project).
- Summarized: News articles with more than 200 words are first summarized using TF-IDF scores and Hopfield Network and are then truncated at 128 sequence length. The summarization method can be found in my "article-summarization" project.
- Filtered: All news articles that cannot be classified by all models are considered as outliers and removed from the original dataset.
Disclaimer: The "how-to" files may display some old results though with accurate process and methodology.
The work done in this project is part of the following publication: "A Benchmark Evaluation Study for Malay Fake News Classification Using Neural Network Architectures" Published in Kazan Digital Week 2020. Methodical and Informational Science Journal, Vestnik NTsBZhD(4), pp. 5-13, 2020. https://ncbgd.tatarstan.ru/rus/file/pub/pub_2610566.pdf http://www.vestnikncbgd.ru/index.php?id=1&lang=en https://kazandigitalweek.com/
The original dataset, toolkit and pre-trained BERT model are provided by: [1] Zolkepli, Husein. “Malay-Dataset.” Github-huseinzol05/Malay-Dataset: Text corpus for Bahasa Malaysia. https://github.com/huseinzol05/Malay-Dataset [2] Zolkepli, Husein. “Malaya.” Github-huseinzol05/Malaya: Natural-Language-Toolkit for Bahasa Malaysia. https://github.com/huseinzol05/Malaya
The chosen model architectures for this project are applications of the following papers: [3] Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014). [4] Nowak, Jakub, Ahmet Taspinar, and Rafał Scherer. "LSTM recurrent neural networks for short text and sentiment classification." In International Conference on Artificial Intelligence and Soft Computing, pp. 553-562. Springer, Cham, 2017. [5] Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis Lau. "A C-LSTM neural network for text classification." arXiv preprint arXiv:1511.08630 (2015). [6] Lai, Siwei, Liheng Xu, Kang Liu, and Jun Zhao. "Recurrent convolutionalneural networks for text classification." In Twenty-ninth AAAI conference on artificial intelligence. 2015. [7] Devlin, Jacob. "Github-google-research/bert: TensorFlow code and pre-trained models for BERT.” Github.com. https://github.com/google-research/bert (accessed March 09, 2020).