fake-news-detection-pipeline
fake-news-detection-pipeline copied to clipboard
Pipeline for detecting fake news, covering data ingestion, doc embedding, classifier hypertuning & model ensembling. Quick walkthrough available in README. Execution logs on my FloydHub page.
migrated from this repo of mine
Fake News Detection Pipeline
Collaborators Shuheng Liu, Qiaoyi Yin, Yuyuan Fang
Group project materials for fake news detection at Hollis Lab, GEC Academy
Project Plan
Table of Contents
-
Fake News Detection Pipeline
- Collaborators Shuheng Liu, Qiaoyi Yin, Yuyuan Fang
- Project Plan
-
Notice for Collaborators
- Doing Train-Test Split
- Directory to Push Models
-
Downloadables
- URL for Different Embeddings Precomputed on Cloud
- Hypyertuning Logs, Codes, and Stats
-
Quick Walkthrough (Presentation)
- Infrastructure for Embeddings
-
Embedding Computation
- URLs
-
Embedding Visualization
- 2D T-SNE
- 3D T-SNE
- Visualizing Bigram Statistics
-
Binary Classification
- Train-Val-Test Split
- Hypertuned Classifiers
- Histogram of CV/Test Scores
- TF-IDF
- Feature Ranking with Logistic Coefficients
- Ensemble Learning
Notice for Collaborators
Doing Train-Test Split
Specifying random_state
in sklearn.model_selection.train_test_split()
ensures same split on different datasets
(of the same length), and on different machines.
(See this link)
For purpose of this project, we will be using random_state=58
for each split.
While grid/random searching for the best set of hyperparameters, a 75%-25% train-test-split is used. A 5-Fold cross-validation is used in the training phase on the 75% samples.
Directory to Push Models
There is a model/
directory nested under the project. Please name your model as model_name.py
, and place it under
the model/
directory (e.g. model/KNN.py
) before pushing to this repo.
Downloadables
Before trying to reproduce our result, please know that pre-computed embeddings can be downloaded from the URLs below. Consider downloading them and storing them into the pretrained/
folder under this repository, which will save a lot of time.
URL for Different Embeddings Precomputed on Cloud
- all computed embeddings and labels, see list below
- onehot title & text (sparse matrix), scorer: raw-count
- onehot title & text (sparse matrix), scorer: raw-count, L2-normalized
- onehot title & text (sparse matrix), scorer: tfidf
- onehot title & text (sparse matrix), scorer: tfidf, L2-normalized
- naive doc2vec title, normalizer: {L2, mean, None}
- naive doc2vec text, normalizer: {L2, mean, None}
- doc2vec title, window_size: 13, min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried
- doc2vec text, window_size: 13, min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried
- doc2vec title, window_size: {13, 23}, min_count: 5, strategy: DBOW, epochs: {200, 500}; all four combinations tried
- doc2vec text, window_size: {13. 23}, min_count: 5, strategy: DBOW, epochs: {200, 500}; all four combinations tried
Hypyertuning Logs, Codes, and Stats
The logs, codes, and stats of hypertuning all simple models (that is, excluding Ensemble model) can be found here.
Quick Walkthrough (Presentation)
Below is the final presentation, originally implemented in jupyter notebook. To see the original presentation file, checkout the following command in your terminal
git log -- "UCB Final Project.ipynb"
or,
git checkout f7e1c41
Alternatively, visit this link which takes you back in history.
Infrastructure for Embeddings
The following classes DocumentSequence
and DocumentEmbedder
can be found in sub-package doc_utils/
. Different ways of computing embeddings (doc2vec, naive doc2vec, one-hot) and their choices of hyperparameters are encapsulated in these files. Below is a snapshot of these classes their methods.
class DocumentSequence:
def __init__(self, raw_docs, clean=False, sw=None, punct=None): ...
# setters (only to be called internally)
def _set_tokenized(self, clean=False, sw=None, punct=None): ...
def _set_tagged(self): ...
def _set_dictionary(self): ...
def _set_bow(self): ...
# getters (exposed)
def get_dictionary(self): ...
dictionary = property(get_dictionary) # property field of get_dictionary()
def get_tokenized(self): ...
tokenized = property(get_tokenized) # property field of get_tokenized()
def get_tagged(self): ...
tagged = property(get_tagged) # property field of get_tagged()
def get_bow(self): ...
bow = property(get_bow) # property field of get_bow()
class DocumentEmbedder:
def __init__(self, docs: DocumentSequence, pretrained_word2vec=None): ...
# setters (only to be called internally)
def _set_word2vec(self): ...
def _set_doc2vec(self, vector_size=300, window=5, min_count=5, dm=1, epochs=20): ...
def _set_naive_doc2vec(self, normalizer='l2'): ...
def _set_tfidf(self): ...
def _set_onehot(self, scorer='tfidf'): ...
# getters (exposed)
def get_onehot(self, scorer='tfidf'): ...
onehot = property(get_onehot) # property field of get_onehot()
def get_doc2vec(self, vectors_size=300, window=5, min_count=5, dm=1, epochs=20): ...
doc2vec = property(get_doc2vec) # property field of get_doc2vec()
def get_naive_doc2vec(self, normalizer='l2'): ...
naive_doc2vec = property(get_naive_doc2vec) # propery field of get_naive_doc2vec()
def get_tfidf_score(self): ...
tfidf = property(get_tfidf_score) # property field of get_tfidf_score()
import pandas as pd
from string import punctuation
from nltk.corpus import stopwords
df = pd.read_csv("./fake_or_real_news.csv")
# obtain the raw news texts and titles
raw_text = df['text'].values
raw_title = df['title'].values
df['label'] = df['label'].apply(lambda label: 1 if label == "FAKE" else 0)
# build two instances for preprocessing raw data
from doc_utils import DocumentSequence
texts = DocumentSequence(raw_text, clean=True, sw=stopwords.words('english'), punct=punctuation)
titles = DocumentSequence(raw_title, clean=True, sw=stopwords.words('english'), punct=punctuation)
df.head()
Unnamed: 0 | title | text | label | title_vectors | |
---|---|---|---|---|---|
0 | 8476 | You Can Smell Hillary’s Fear | Daniel Greenfield, a Shillman Journalism Fello... | 1 | [ 1.1533764e-02 4.2144405e-03 1.9692603e-02 ... |
1 | 10294 | Watch The Exact Moment Paul Ryan Committed Pol... | Google Pinterest Digg Linkedin Reddit Stumbleu... | 1 | [ 0.11267698 0.02518966 -0.00212591 0.021095... |
2 | 3608 | Kerry to go to Paris in gesture of sympathy | U.S. Secretary of State John F. Kerry said Mon... | 0 | [ 0.04253004 0.04300297 0.01848392 0.048672... |
3 | 10142 | Bernie supporters on Twitter erupt in anger ag... | — Kaydee King (@KaydeeKing) November 9, 2016 T... | 1 | [ 0.10801624 0.11583211 0.02874823 0.061732... |
4 | 875 | The Battle of New York: Why This Primary Matters | It's primary day in New York and front-runners... | 0 | [ 1.69016439e-02 7.13498285e-03 -7.81233795e-... |
Embedding Computation
URLs
- all computed embeddings and labels, see list below
- onehot title & text (sparse matrix), scorer: raw-count
- onehot title & text (sparse matrix), scorer: raw-count, L2-normalized
- onehot title & text (sparse matrix), scorer: tfidf
- onehot title & text (sparse matrix), scorer: tfidf, L2-normalized
- naive doc2vec title, normalizer: {L2, mean, None}
- naive doc2vec text, normalizer: {L2, mean, None}
- doc2vec title, window_size: 13, min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried
- doc2vec text, window_size: 13, min_count:{5, 25, 50}, strategy: {DM, DBOW}, epochs: 100; all six combinations tried
- doc2vec title, window_size: {13, 23}, min_count: 5, strategy: DBOW, epochs: {200, 500}; all four combinations tried
- doc2vec text, window_size: {13. 23}, min_count: 5, strategy: DBOW, epochs: {200, 500}; all four combinations tried
from doc_utils import DocumentEmbedder
try:
from embedding_utils import EmbeddingLoader
loader = EmbeddingLoader("pretrained/")
news_embeddings = loader.get_d2v("concat", vec_size=300, win_size=23, min_count=5, dm=0, epochs=500)
labels = loader.get_label()
except FileNotFoundError as e:
print(e)
print("Cannot find existing embeddings, computing new ones now")
pretrained = "./pretrained/GoogleNews-vectors-negative300.bin"
text_embedder = DocumentEmbedder(texts, pretrained_word2vec=pretrained)
title_embedder = DocumentEmbedder(titles, pretrained_word2vec=pretrained)
text_embeddings = text_embedder.get_doc2vec(vectors_size=300, window=23, min_count=5, dm=0, epochs=500)
title_embeddings = title_embedder.get_doc2vec(vectors_size=300, window=23, min_count=5, dm=0, epochs=500)
# concatenate title vectors and text vectors
news_embeddings = np.concatenate((title_embeddings, text_embeddings), axis=1)
labels = df['label'].values
Embedding Visualization
from embedding_utils import visualize_embeddings
# visualize the news embeddings in the graph
# MUST run in command line "tensorboard --logdir visual/" and visit localhost:6006 to see the visualization
visualize_embeddings(embedding_values=news_embeddings, label_values=labels, texts = raw_title)
print("visit https://localhost:6006 to see the result")
# ATTENTION: This cell must be manually stopped
visit https://localhost:6006 to see the result
Some screenshots of the tensorboard are shown below. We visuallize the embeddings of documents with T-SNE projection on 3D and 2D spaces. Each red data point indicates a piece of FAKE news, and each blue one indicates a piece of real news. These two categories are well-separated as can be seen from the visualization.
2D T-SNE
red for fake ones, blue for real ones
3D T-SNE
red for fake ones, blue for real ones
Visualizing Bigram Statistics
import itertools
import nltk
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
## Get tokenized words of fake news and real news independently
real_text = df[df['label'] == 0]['text'].values
fake_text = df[df['label'] == 1]['text'].values
sw = [word for word in stopwords.words("english")] + ["``", "“"]
other_puncts = u'.,;《》?!“”‘’@#¥%…&×()——+【】{};;●,。&~、|\s::````'
punct = punctuation + other_puncts
fake_words = DocumentSequence(real_text, clean=True, sw=sw, punct=punct)
real_words = DocumentSequence(fake_text, clean=True, sw=sw, punct=punct)
## Get cleaned text using chain
real_words_all = list(itertools.chain(*real_words.get_tokenized()))
fake_words_all = list(itertools.chain(*fake_words.get_tokenized()))
## Drawing histogram
def plot_most_common_words(num_to_show,words_list,title = ""):
bigrams = nltk.bigrams(words_list)
counter = Counter(bigrams)
labels = [" ".join(e[0]) for e in counter.most_common(num_to_show)]
values = [e[1] for e in counter.most_common(num_to_show)]
indexes = np.arange(len(labels))
width = 1
plt.title(title)
plt.barh(indexes, values, width)
plt.yticks(indexes + width * 0.2, labels)
plt.show()
plot_most_common_words(20, fake_words_all, "Fake News Most Frequent words")
plot_most_common_words(20, real_words_all, "Real News Most Frequent words")
Binary Classification
Train-Val-Test Split
(with 75% of data for 5-fold Random CV, 25% for testing)
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.model_selection._search import BaseSearchCV
import pickle as pkl
seed = 58
# perform the split which gets us the train data and the test data
news_train, news_test, labels_train, labels_test = train_test_split(news_embeddings, labels,
test_size=0.25,
random_state=seed,
stratify=labels)
Hypertuned Classifiers
We used RandomSearch on different datasets to get the best hyper-parameters.
The following exhibits every classifier with almost optimal parameters in our experiments.
The RandomSearch process is omitted.
from model.hypyertuned_models import mlp, knn, qda, gdb, svc, gnb, rf, lg
from model.hypyertuned_models import classifiers as classifiers_list
We list the best-performing hyperparameters in the following chart.
from sklearn.metrics import classification_report
# print details of testing results
for model in classifiers_list:
model.fit(news_train, labels_train)
labels_pred = model.predict(news_test)
# Report the metrics
target_names = ['Real', 'Fake']
print(model.__class__.__name__)
print(classification_report(y_true=labels_test, y_pred=labels_pred, target_names=target_names, digits=3))
MLPClassifier
precision recall f1-score support
Real 0.956 0.950 0.953 793
Fake 0.950 0.956 0.953 791
avg / total 0.953 0.953 0.953 1584
KNeighborsClassifier
precision recall f1-score support
Real 0.849 0.905 0.876 793
Fake 0.898 0.838 0.867 791
avg / total 0.874 0.872 0.872 1584
QuadraticDiscriminantAnalysis
precision recall f1-score support
Real 0.784 0.995 0.877 793
Fake 0.993 0.726 0.839 791
avg / total 0.889 0.860 0.858 1584
GradientBoostingClassifier
precision recall f1-score support
Real 0.921 0.868 0.894 793
Fake 0.875 0.925 0.899 791
avg / total 0.898 0.896 0.896 1584
SVC
precision recall f1-score support
Real 0.944 0.939 0.942 793
Fake 0.940 0.944 0.942 791
avg / total 0.942 0.942 0.942 1584
GaussianNB
precision recall f1-score support
Real 0.848 0.793 0.820 793
Fake 0.805 0.857 0.830 791
avg / total 0.826 0.825 0.825 1584
RandomForestClassifier
precision recall f1-score support
Real 0.868 0.805 0.835 793
Fake 0.817 0.877 0.846 791
avg / total 0.843 0.841 0.841 1584
LogisticRegression
precision recall f1-score support
Real 0.921 0.929 0.925 793
Fake 0.929 0.920 0.924 791
avg / total 0.925 0.925 0.925 1584
Histogram of CV/Test Scores
TF-IDF
Getting sparse matrix
def bow2sparse(tfidf, corpus):
rows = [index for index, line in enumerate(corpus) for _ in tfidf[line]]
cols = [elem[0] for line in corpus for elem in tfidf[line]]
data = [elem[1] for line in corpus for elem in tfidf[line]]
return csr_matrix((data, (rows, cols)))
from gensim import corpora, models
from scipy.sparse import csr_matrix
tfidf = models.TfidfModel(texts.get_bow())
tfidf_matrix = bow2sparse(tfidf, texts.get_bow())
## split the data
news_train, news_test, labels_train, labels_test = train_test_split(tfidf_matrix,
labels,
test_size=0.25,
random_state=seed)
dictionary is not set for <tools.DocumentSequence object at 0x11766bac8>, setting dictionary automatically
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
# LogisticRegression
lg = LogisticRegression(C=104.31438384172546, penalty = 'l2')
# Naive Bayes
nb = MultinomialNB(alpha = 0.01977091215797838)
classifiers_list = [lg, nb]
from sklearn.metrics import classification_report
# print details of testing results
for model in classifiers_list:
model.fit(news_train, labels_train)
labels_pred = model.predict(news_test)
# Report the metrics
target_names = ['Real', 'Fake']
print(str(model))
print(classification_report(y_true=labels_test, y_pred=labels_pred, target_names=target_names, digits=3))
LogisticRegression(C=104.31438384172546, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
precision recall f1-score support
Real 0.964 0.913 0.938 820
Fake 0.912 0.963 0.937 764
avg / total 0.939 0.938 0.938 1584
MultinomialNB(alpha=0.01977091215797838, class_prior=None, fit_prior=True)
precision recall f1-score support
Real 0.899 0.930 0.914 820
Fake 0.922 0.887 0.905 764
avg / total 0.910 0.910 0.910 1584
Feature Ranking with Logistic Coefficients
# LogisticRegression
lg = LogisticRegression(C=104.31438384172546, penalty = 'l2')
# Using whole data set
lg.fit(tfidf_matrix, labels)
# map the coeffients with word and sort the coeffients
abs_features = []
num_features = tfidf_matrix.shape[0]
for i in range(num_features):
coef = lg.coef_[0,i]
abs_features.append(((coef), texts.get_dictionary()[i]))
sorted_result = sorted(abs_features, reverse = True)
fake_importance = [x for x in sorted_result if x[0] > 3]
real_importance = [x for x in sorted_result if x[0] < -4]
from wordcloud import WordCloud, STOPWORDS
def print_wordcloud(df, title=''):
wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', width=1200, height=1000).generate(
" ".join(df['word'].values))
plt.imshow(wordcloud)
plt.title(title)
plt.axis('off')
plt.show()
Words with inclination to predict 'FAKE' news
df2 = pd.DataFrame(fake_importance, columns=['importance', 'word'])
df2.head(30)
importance | word | |
---|---|---|
0 | 13.781102 | 0 |
1 | 13.562957 | 2016 |
2 | 13.490582 | october |
3 | 13.062496 | hillary |
4 | 11.192181 | ‘ |
5 | 9.829864 | article |
6 | 9.411360 | election |
7 | 8.903777 | november |
8 | 8.181044 | share |
9 | 7.564924 | |
10 | 7.507189 | source |
11 | 7.418819 | via |
12 | 7.150410 | fbi |
13 | 6.939386 | establishment |
14 | 6.752492 | us |
15 | 6.549759 | please |
16 | 6.421927 | 28 |
17 | 6.111584 | wikileaks |
18 | 5.914297 | russia |
19 | 5.777677 | 4 |
20 | 5.701762 | › |
21 | 5.701082 | |
22 | 5.633363 | war |
23 | 5.461951 | corporate |
24 | 5.432547 | 26 |
25 | 5.248264 | photo |
26 | 5.205658 | 1 |
27 | 5.178585 | healthcare |
28 | 5.066447 | |
29 | 5.055815 | free |
print_wordcloud(df2,'FAKE NEWS')
Words with inclination to predict 'REAL' news
df3 = pd.DataFrame(real_importance, columns=['importance', 'word'])
df3.tail(30)
importance | word | |
---|---|---|
48 | -5.819761 | march |
49 | -5.820939 | state |
50 | -5.911077 | attacks |
51 | -5.911102 | deal |
52 | -5.918800 | monday |
53 | -5.937717 | saturday |
54 | -6.068661 | president |
55 | -6.108548 | conservatives |
56 | -6.197634 | sanders |
57 | -6.316225 | continue |
58 | -6.577535 | `` |
59 | -6.595120 | polarization |
60 | -6.629481 | fox |
61 | -6.644741 | gop |
62 | -6.681231 | ohio |
63 | -6.899471 | convention |
64 | -7.051062 | jobs |
65 | -7.260832 | debate |
66 | -7.274652 | friday |
67 | -7.580725 | tuesday |
68 | -7.847131 | cruz |
69 | -8.058610 | candidates |
70 | -8.348688 | conservative |
71 | -8.440797 | says |
72 | -8.828907 | islamic |
73 | -10.438137 | — |
74 | -10.851531 | -- |
75 | -14.864650 | '' |
76 | -14.912260 | said |
77 | -16.351588 | 's |
print_wordcloud(df3,'REAL NEWS')
Ensemble Learning
Besides, we used ensemble vote classifier to model the train data and try to obtain a better prediction from ensemble learning.
from model.ensemble_learning import EnsembleVoter
d2v_500 = loader.get_d2v(corpus="concat", win_size=23, epochs=500)
d2v_100 = loader.get_d2v(corpus="concat", win_size=13, epochs=100)
onehot = loader.get_onehot(corpus="concat", scorer="tfidf")
labels = loader.get_label()
d2v_500_train, d2v_500_test, d2v_100_train, d2v_100_test, onehot_train, onehot_test, labels_train, labels_test = \
train_test_split(d2v_500, d2v_100, onehot, labels, test_size=0.25, stratify=labels, random_state=seed)
classifiers = [mlp, svc, qda, lg]
Xs_train = [d2v_500_train, d2v_100_train, d2v_100_train, onehot_train]
Xs_test = [d2v_500_test, d2v_100_test, d2v_100_test, onehot_test]
ens_voter = EnsembleVoter(classifiers, Xs_train, Xs_test, labels_train, labels_test)
ens_voter.fit()
print("Test score of EnsembleVoter: ", ens_voter.score())
Test score of MLPClassifier: 0.9526515151515151
Test score of SVC: 0.9425505050505051
Test score of QuadraticDiscriminantAnalysis: 0.9463383838383839
Test score of LogisticRegression: 0.9513888888888888
Fittng aborted because all voters are fitted and not using refit=True
Test score of EnsembleVoter: 0.963901203293