Fake News Detection

Information Retrieval and Text Mining project

News Insight
News Classification
Text Regression

Presentation Video

https://www.youtube.com/watch?v=9PFZ0_C2Sxo&feature=share

Topic: Fake News Analysis and Insight

蒐集各方假新聞dataset
可以從假新聞或真新聞中分析出什麼樣的消息?
- 用怎樣的方法分析或比較?
假新聞相較於真新聞有怎樣的特徵?
- 怎麼抓取特徵或關鍵字?
- 可能用到的情緒字 / 情緒分析
- 依照詞性去對假真新聞決定可能會有那些常用字。EX:文字雲
- 語意分析
假新聞分類、評比
- 特性、提醒使用者

Problem Description

https://docs.google.com/document/d/10-7H9bPJYQRMdOUdugDlWeifdpvoN9twXZGT-m1fhdc/edit?usp=sharing

Implement Report

https://docs.google.com/document/d/1I9SWihDkgXx1NCYCsY-0e_XDicAK346PqQu5wMaesd0/edit

Presentation slides

https://docs.google.com/presentation/d/1lRDR40UfcLpdRUSnfMbi6eOsR_jjxFdOcKwa8HvxHh8/edit#slide=id.p

Outline:

動機
做甚麼
solution insight
solution regression / classification

Motivation & Goal

動機: 為什麼要做?因為假新聞氾濫、影響閱聽人、帶選舉風向的問題

假新聞的程度
真假新聞之間有什麼區別
(假)新聞的種類

比較不同方法的performance

Solution

TF-IDF。給Tagging
POS (part-of-speech tagging) EX:openNLP、NLTK => a.每個不同dataset的詞性常出現哪些字 b. dictionary by overall dataset依詞性要用哪些字
Sentiment Analysis EX:TextBlob、
feature selection: 關鍵字、類別鑑別力
作者、來源的助益性。每一種類別的差別
regression (ML方法、DL方法) / classification (IR方法)

GOAL

在相同dictionary大小下: 沒有分詞性情況下跑出來幾分，有詞性的dictionary跑出來幾分 EX:名詞dictionary幾分，動詞正確率幾趴?
前面所做的insight可以跟最後面產生的dict有關連
假新聞的程度、分類，兩者testing dataset互為兩者
時間切三塊或五塊: 選前、選舉正負一個禮拜、選後，主題、用字、情感的變動

Dataset

分為十類別(第二個dataset八類、第一個dataset兩類): 第三個dataset的True、mostly Tru放進去第一個dataset的true；第三個dataset的barely-true、false、pants-fire放進去第一個dataset的False
濾除標點符號跟數字、大寫變小寫，只留下 content(最長的attribute)、label (假新聞的程度、類別)

三個dataset的text,label合併資料集：https://drive.google.com/drive/u/2/folders/19CER5SrMU29n3UPAkQc2hPu3HA8vyqbc

Method

目前只看news content

十個類別的POS、overall dataset的POS https://drive.google.com/drive/folders/1C-6U9TcyUwgxzdArvAXPsnjx9yrPhxsh?usp=sharing
十個類別的長條圖of情緒分析。文獻探討: 詞性、情緒、feature selection、分類、回歸等等套件的論文
十個類別的文字雲、頻率圖=>做一個overall的，把各類別常見的term的濾掉
3 kind of feature selecion、tfidf of building overall dictionary

bs類別代表意義不大

testing Kaggle: https://www.kaggle.com/c/fake-news/submit

(測試clf好壞結果、reg好壞結果)

Possible Dataset:

https://www.kaggle.com/c/fake-news/data (title、author、text、true/false；來自爬文的news articles) =>
https://github.com/KaiDMML/FakeNewsNet/tree/master/Data (news source, headline, image, body_text, publish_data, etc、包含真假新聞；爬文新聞)
https://www.kaggle.com/mrisdal/fake-news (uuidUnique identifier,ord_in_thread,authorauthor of story,publisheddate published ,titletitle of the story,texttext of story,languagedata from webhose.io,crawleddate the story was archived,site_urlsite URL from BS detector,countrydata from webhose.io,domain_rankdata from webhose.io,thread_title,spam_scoredata from webhose.io,main_img_urlimage from story,replies_countnumber of replies,participants_countnumber of participants,likesnumber of Facebook likes,commentsnumber of Facebook comments,sharesnumber of Facebook shares,typetype of website (label from BS detector)) https://github.com/bs-detector/bs-detector
https://github.com/GeorgeMcIntire/fake_real_news_dataset (csv file and contains 1000s of articles tagged as either real or fake)
https://www.cs.ucsb.edu/~william/data/liar_dataset.zip (假新聞程度分級；UCSB)(statement、speaker、conext、label、src)
https://www.kaggle.com/jruvika/fake-news-detection (URLs,Headline,Body,Label(T/F)；)
https://www.kaggle.com/c/fake-news-pair-classification-challenge/data (fake news classification)
https://github.com/JasonKessler/fakeout (完整的project)
https://github.com/FakeNewsChallenge/fnc-1 (之前辦過的比賽)
tweets: https://www.nbcnews.com/tech/social-media/now-available-more-200-000-deleted-russian-troll-tweets-n844731
datasets: https://data.world/datasets/fake-news 、 https://github.com/sumeetkr/AwesomeFakeNews
preprocess ref: https://www.kaggle.com/rchitic17/fake-news 、 https://www.kaggle.com/michaleczuszek/fake-news-analysis

Motivation Reference

https://www.ithome.com.tw/news/127214?fbclid=IwAR0oKz7wm0Ub0Kb5FDh9HAvjKX5tgidTtZrFRSY_kVsgQrue5_-K-5iSC-o
https://www.ithome.com.tw/news/127201?fbclid=IwAR3_vIk3Pdvsem1d_uAWyaiZHUj8C51JLzene9jYOtc50KL31xgEHiHYfLQ

Possible Goal:

協助使用者判斷真假
知道假新聞pattern、用字特性、文章特徵
新聞分類
真假新聞常用的字
爬文insight ( https://shift.newco.co/2016/11/09/What-I-Discovered-About-Trump-and-Clinton-From-Analyzing-4-Million-Facebook-Posts/ )
分析 ( https://towardsdatascience.com/i-trained-fake-news-detection-ai-with-95-accuracy-and-almost-went-crazy-d10589aa57c 、 http://nbviewer.jupyter.org/github/JasonKessler/fakeout/blob/master/Fake%20News%20Analysis.ipynb)

REF

題目參考資料1: http://www.im.ntu.edu.tw/~paton/courses.htm
題目參考資料2: https://mega.nz/#!xwdEgAjb!FAVoAznYD7bE5rsoXc7isRJUlAbF0m8mamYe2RiCwMM
題目參考資料3: https://mega.nz/#!UlNmXQIS!7dZhNx0Cy9-VyjlEI5GUO5zjIgYNJoe9dUAPaCNcowA
文字雲: https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis
TextBlob情感分析: https://nlp.stanford.edu/courses/cs224n/2009/fp/24.pdf (套用NLTK movie_review當作training data)(https://stackoverflow.com/questions/34518570/how-are-sentiment-analysis-computed-in-blob/34519114#34519114)
NLTK詞性分析(pos_tager): https://explosion.ai/blog/part-of-speech-pos-tagger-in-python (Greedy Averaged Perceptron tagger?)(taining data Sections 00-18 of the Wall Street Journal sections of OntoNotes 5)(https://stackoverflow.com/questions/32016545/how-does-nltk-pos-tag-work)

Datasets for sentiment analysis are available online.[1][2]

The following is a list of a few open source sentiment analysis tools.

GATE plugins
SEAS(gsi-upm/SEAS)
SAGA(gsi-upm/SAGA)
Stanford Sentiment Analysis Module (Deeply Moving: Deep Learning for Sentiment Analysis)
LingPipe (Sentiment Analysis Tutorial)
TextBlob (Tutorial: Quickstart)[3]
Opinion Finder (OpinionFinder | MPQA)
Clips pattern.en (pattern.en | CLiPS)

Open Source Dictionary or resources:

SentiWordNet
Bing Liu Datasets (Opinion Mining, Sentiment Analysis, Opinion Extraction)
General Inquirer Dataset (General Inquirer Categories)
MPQA Opinion Corpus (MPQA Resources)
WordNet-Affect (WordNet Domains)
SenticNet
Emoji Sentiment Ranking

文獻探討: 其他人怎麼做的

方向: 文字分類(classification) or 程度回歸(regression)

文字分類

A novel text mining approach based on TF-IDF and Support Vector Machine for news classification https://ieeexplore.ieee.org/abstract/document/7569223
TEXT CLASSIFICATION USING NAÏVE BAYES, VSM AND POS TAGGER https://pdfs.semanticscholar.org/43d0/0d394ff76c0a5426c37fe072038ac7ec7627.pdf
Text categorization with Support Vector Machines: Learning with many relevant features https://link.springer.com/content/pdf/10.1007%2FBFb0026683.pdf
Unsupervised Content-Based Identification of Fake News Articles with Tensor Decomposition Ensembles: http://snap.stanford.edu/mis2/files/MIS2_paper_2.pdf

NLP_FakeNewsDetection
NLP_FakeNewsDetection copied to clipboard

Metadata

Fake News Detection

Presentation Video

Topic: Fake News Analysis and Insight

Problem Description

Implement Report

Presentation slides

Motivation & Goal

Possible Dataset:

Motivation Reference

Possible Goal:

REF

文獻探討: 其他人怎麼做的

← Metadata

Owner

Metadata

NLP_FakeNewsDetection NLP_FakeNewsDetection copied to clipboard

Metadata

Fake News Detection

Presentation Video

Topic: Fake News Analysis and Insight

Problem Description

Implement Report

Presentation slides

Motivation & Goal

Possible Dataset:

Motivation Reference

Possible Goal:

REF

文獻探討: 其他人怎麼做的

← Metadata

Owner

Metadata

NLP_FakeNewsDetection
NLP_FakeNewsDetection copied to clipboard