nlp_tasks
                                
                                 nlp_tasks copied to clipboard
                                
                                    nlp_tasks copied to clipboard
                            
                            
                            
                        Natural Language Processing Tasks and References
Natural Language Processing Tasks and Selected References
I've been working on several natural language processing tasks for a long time. One day, I felt like drawing a map of the NLP field where I earn a living. I'm sure I'm not the only person who wants to see at a glance which tasks are in NLP.
I did my best to cover as many as possible tasks in NLP, but admittedly this is far from exhaustive purely due to my lack of knowledge. And selected references are biased towards recent deep learning accomplishments. I expect these serve as a starting point when you're about to dig into the task. I'll keep updating this repo myself, but what I really hope is you collaborate on this work. Don't hesitate to send me a pull request!
Oct. 13, 2017.
by Kyubyong
Reviewed and updated by YJ Choe on Oct. 18, 2017.
Anaphora Resolution
- See Coreference Resolution
Automated Essay Scoring
- PAPERAutomatic Text Scoring Using Neural Networks
- PAPERA Neural Approach to Automated Essay Scoring
- CHALLENGEKaggle: The Hewlett Foundation: Automated Essay Scoring
- PROJECTEASE (Enhanced AI Scoring Engine)
Automatic Speech Recognition
- WIKISpeech recognition
- PAPERDeep Speech 2: End-to-End Speech Recognition in English and Mandarin
- PAPERWaveNet: A Generative Model for Raw Audio
- PROJECTA TensorFlow implementation of Baidu's DeepSpeech architecture
- PROJECTSpeech-to-Text-WaveNet : End-to-end sentence level English speech recognition using DeepMind's WaveNet
- CHALLENGEThe 5th CHiME Speech Separation and Recognition Challenge
- DATAThe 5th CHiME Speech Separation and Recognition Challenge
- DATACSTR VCTK Corpus
- DATALibriSpeech ASR corpus
- DATASwitchboard-1 Telephone Speech Corpus
- DATATED-LIUM Corpus
- DATAOpen Speech and Language Resources
- DATACommon Voice
Automatic Summarisation
- WIKIAutomatic summarization
- BOOKAutomatic Text Summarization
- PAPERText Summarization Using Neural Networks
- PAPERRanking with Recursive Neural Networks and Its Application to Multi-Document Summarization
- DATAText Analytics Conferences (TAC)
- DATADocument Understanding Conferences (DUC)
Coreference Resolution
- INFOCoreference Resolution
- PAPERDeep Reinforcement Learning for Mention-Ranking Coreference Models
- PAPERImproving Coreference Resolution by Learning Entity-Level Distributed Representations
- CHALLENGECoNLL 2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes
- CHALLENGECoNLL 2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes
- CHALLENGESemEval 2018 Task 4: Character Identification on Multiparty Dialogues
Entity Linking
- See Named Entity Disambiguation
Grammatical Error Correction
- PAPERA Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction
- PAPERNeural Network Translation Models for Grammatical Error Correction
- PAPERAdapting Sequence Models for Sentence Correction
- CHALLENGECoNLL-2013 Shared Task: Grammatical Error Correction
- CHALLENGECoNLL-2014 Shared Task: Grammatical Error Correction
- DATANUS Non-commercial research/trial corpus license
- DATALang-8 Learner Corpora
- DATACornell Movie--Dialogs Corpus
- PROJECTDeep Text Corrector
- PRODUCTdeep grammar
Grapheme To Phoneme Conversion
- PAPERGrapheme-to-Phoneme Models for (Almost) Any Language
- PAPERPolyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning
- PAPERMultitask Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion
- PROJECTSequence-to-Sequence G2P toolkit
- PROJECTg2p_en: A Simple Python Module for English Grapheme To Phoneme Conversion
- DATAMultilingual Pronunciation Data
Humor and Sarcasm Detection
- PAPERAutomatic Sarcasm Detection: A Survey
- PAPERMagnets for Sarcasm: Making Sarcasm Detection Timely, Contextual and Very Personal
- PAPERSarcasm Detection on Twitter: A Behavioral Modeling Approach
- CHALLENGESemEval-2017 Task 6: #HashtagWars: Learning a Sense of Humor
- CHALLENGESemEval-2017 Task 7: Detection and Interpretation of English Puns
- DATASarcastic comments from Reddit
- DATASarcasm Corpus V2
- DATASarcasm Amazon Reviews Corpus
Language Grounding
- WIKISymbol grounding problem
- PAPERThe Symbol Grounding Problem
- PAPERFrom phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning
- PAPEREncoding of phonology in a recurrent neural model of grounded speech
- PAPERGated-Attention Architectures for Task-Oriented Language Grounding
- PAPERSound-Word2Vec: Learning Word Representations Grounded in Sounds
- COURSELanguage Grounding to Vision and Control
- WORKSHOPLanguage Grounding for Robotics
Language Guessing
- See Language Identification
Language Identification
- WIKILanguage identification
- PAPERAUTOMATIC LANGUAGE IDENTIFICATION USING DEEP NEURAL NETWORKS
- PAPERNatural Language Processing with Small Feed-Forward Networks
- CHALLENGE2015 Language Recognition Evaluation
Language Modeling
- WIKILanguage model
- TOOLKITKenLM Language Model Toolkit
- PAPERDistributed Representations of Words and Phrases and their Compositionality
- PAPERGenerating Sequences with Recurrent Neural Networks
- PAPERCharacter-Aware Neural Language Models
- THESISStatistical Language Models Based on Neural Networks
- DATAPenn Treebank
- TUTORIALTensorFlow Tutorial on Language Modeling with Recurrent Neural Networks
Language Recognition
- See Language Identification
Lemmatisation
- WIKILemmatisation
- PAPERJoint Lemmatization and Morphological Tagging with LEMMING
- TOOLKITWordNet Lemmatizer
- DATATreebank-3
Lip-reading
- WIKILip reading
- PAPERLipNet: End-to-End Sentence-level Lipreading
- PAPERLip Reading Sentences in the Wild
- PAPERLarge-Scale Visual Speech Recognition
- PROJECTLip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks
- PRODUCTLiopa
- DATAThe GRID audiovisual sentence corpus
- DATAThe BBC-Oxford 'Multi-View Lip Reading Sentences' (MV-LRS) Dataset
Machine Translation
- PAPERNeural Machine Translation by Jointly Learning to Align and Translate
- PAPERNeural Machine Translation in Linear Time
- PAPERAttention Is All You Need
- PAPERSix Challenges for Neural Machine Translation
- PAPERPhrase-Based & Neural Unsupervised Machine Translation
- CHALLENGEACL 2014 NINTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION
- CHALLENGEEMNLP 2017 SECOND CONFERENCE ON MACHINE TRANSLATION (WMT17)
- DATAOpenSubtitles2016
- DATAWIT3: Web Inventory of Transcribed and Translated Talks
- DATAThe QCRI Educational Domain (QED) Corpus
- PAPERMulti-task Sequence to Sequence Learning
- PAPERUnsupervised Pretraining for Sequence to Sequence Learning
- PAPERGoogle’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
- TOOLKITSubword Neural Machine Translation with Byte Pair Encoding (BPE)
- TOOLKITMulti-Way Neural Machine Translation
- TOOLKITOpenNMT: Open-Source Toolkit for Neural Machine Translation
Morphological Inflection Generation
- WIKIInflection
- PAPERMorphological Inflection Generation Using Character Sequence to Sequence Learning
- CHALLENGESIGMORPHON 2016 Shared Task: Morphological Reinflection
- DATAsigmorphon2016
Named Entity Disambiguation
Named Entity Recognition
- WIKINamed-entity recognition
- PAPERNeural Architectures for Named Entity Recognition
- PROJECTOSU Twitter NLP Tools
- CHALLENGENamed Entity Recognition in Twitter
- CHALLENGECoNLL 2002 Language-Independent Named Entity Recognition
- CHALLENGEIntroduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
- DATACoNLL-2002 NER corpus
- DATACoNLL-2003 NER corpus
- DATANUT Named Entity Recognition in Twitter Shared task
- TOOLKITStanford Named Entity Recognizer
Paraphrase Detection
- PAPERDynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
- PROJECTParalex: Paraphrase-Driven Learning for Open Question Answering
- CHALLENGESemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter
- DATAMicrosoft Research Paraphrase Corpus
- DATAMicrosoft Research Video Description Corpus
- DATAPascal Dataset
- DATAFlickr Dataset
- DATAThe SICK data set
- DATAPPDB: The Paraphrase Database
- DATAWikiAnswers Paraphrase Corpus
Paraphrase Generation
- PAPERNeural Paraphrase Generation with Stacked Residual LSTM Networks
- DATANeural Paraphrase Generation with Stacked Residual LSTM Networks
- CODENeural Paraphrase Generation with Stacked Residual LSTM Networks
- PAPERA Deep Generative Framework for Paraphrase Generation
- PAPERParaphrasing Revisited with Neural Machine Translation
Parsing
- WIKIParsing
- TOOLKITThe Stanford Parser: A statistical parser
- TOOLKITspaCy parser
- PAPERGrammar as a Foreign Language
- PAPERA fast and accurate dependency parser using neural networks
- PAPERUniversal Semantic Parsing
- CHALLENGECoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
- CHALLENGECoNLL 2016 Shared Task: Multilingual Shallow Discourse Parsing
- CHALLENGECoNLL 2015 Shared Task: Shallow Discourse Parsing
- CHALLENGESemEval-2016 Task 8: The meaning representations may be abstract, but this task is concrete!
Part-of-speech Tagging
- WIKIPart-of-speech tagging
- PAPERMultilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss
- PAPERUnsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models
- DATATreebank-3
- TOOLKITnltk.tag package
Pinyin-To-Chinese Conversion
- WIKIPinyin input method
- PAPERNeural Network Language Model for Chinese Pinyin Input Method Engine
- PROJECTNeural Chinese Transliterator
Question Answering
- WIKIQuestion answering
- PAPERAsk Me Anything: Dynamic Memory Networks for Natural Language Processing
- PAPERDynamic Memory Networks for Visual and Textual Question Answering
- CHALLENGETREC Question Answering Task
- CHALLENGENTCIR-8: Advanced Cross-lingual Information Access (ACLIA)
- CHALLENGECLEF Question Answering Track
- CHALLENGESemEval-2017 Task 3: Community Question Answering
- CHALLENGESemEval-2018 Task 11: Machine Comprehension using Commonsense Knowledge
- DATAMS MARCO: Microsoft MAchine Reading COmprehension Dataset
- DATAMaluuba NewsQA
- DATASQuAD: 100,000+ Questions for Machine Comprehension of Text
- DATAGraphQuestions: A Characteristic-rich Question Answering Dataset
- DATAStory Cloze Test and ROCStories Corpora
- DATAMicrosoft Research WikiQA Corpus
- DATADeepMind Q&A Dataset
- DATAQASent
- DATATextbook Question Answering
Relationship Extraction
- WIKIRelationship extraction
- PAPERA deep learning approach for relationship extraction from interaction context in social manufacturing paradigm
- CHALLENGESemEval-2018 task 7 Semantic Relation Extraction and Classification in Scientific Papers
Semantic Role Labeling
- WIKISemantic role labeling
- BOOKSemantic Role Labeling
- PAPEREnd-to-end Learning of Semantic Role Labeling Using Recurrent Neural Networks
- PAPERNeural Semantic Role Labeling with Dependency Path Embeddings
- PAPERDeep Semantic Role Labeling: What Works and What's Next
- CHALLENGECoNLL-2005 Shared Task: Semantic Role Labeling
- CHALLENGECoNLL-2004 Shared Task: Semantic Role Labeling
- TOOLKITIllinois Semantic Role Labeler (SRL)
- DATACoNLL-2005 Shared Task: Semantic Role Labeling
Sentence Boundary Disambiguation
- WIKISentence boundary disambiguation
- PAPERA Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain
- TOOLKITNLTK Tokenizers
- DATAThe British National Corpus
- DATASwitchboard-1 Telephone Speech Corpus
Sentiment Analysis
- WIKISentiment analysis
- INFOAwesome Sentiment Analysis
- CHALLENGEKaggle: UMICH SI650 - Sentiment Classification
- CHALLENGESemEval-2017 Task 4: Sentiment Analysis in Twitter
- CHALLENGESemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs and News
- PROJECTSenticNet
- PROJECTStanford NLP Group Sentiment Analysis
- DATAMulti-Domain Sentiment Dataset (version 2.0)
- DATAStanford Sentiment Treebank
- DATATwitter Sentiment Corpus
- DATATwitter Sentiment Analysis Training Corpus
- DATAAFINN: List of English words rated for valence
Sign Language Recognition/Translation
- PAPERVideo-based Sign Language Recognition without Temporal Segmentation
- PAPERSubUNets: End-to-end Hand Shape and Continuous Sign Language Recognition
- DATARWTH-PHOENIX-Weather
- DATAASLLRP
- PROJECTSignAll
Singing Voice Synthesis
- PAPERSinging voice synthesis based on deep neural networks
- PAPERA Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs
- PRODUCTVOCALOID: voice synthesis technology and software developed by Yamaha
- CHALLENGESpecial Session Interspeech 2016 Singing synthesis challenge "Fill-in the Gap"
Social Science Applications
- WORKSHOPNLP+CSS: Workshops on Natural Language Processing and Computational Social Science
- TOOLKITMen Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints
- TOOLKITOnline Variational Bayes for Latent Dirichlet Allocation (LDA)
- GROUPThe University of Chicago Knowledge Lab
Source Separation
- WIKISource separation
- PAPERFrom Blind to Guided Audio Source Separation
- PAPERJoint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation
- CHALLENGESignal Separation Evaluation Campaign (SiSEC)
- CHALLENGECHiME Speech Separation and Recognition Challenge
Speaker Authentication
- See Speaker Verification
Speaker Diarisation
- WIKISpeaker diarisation
- PAPERDNN-based speaker clustering for speaker diarisation
- PAPERUnsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach
- PAPERAudio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
- CHALLENGERich Transcription Evaluation
Speaker Recognition
- WIKISpeaker recognition
- PAPERA NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK
- PAPERDEEP NEURAL NETWORKS FOR SMALL FOOTPRINT TEXT-DEPENDENT SPEAKER VERIFICATION
- PAPERDeep Speaker: an End-to-End Neural Speaker Embedding System
- PROJECTVoice Vector: which of the Hollywood stars is most similar to my voice?
- CHALLENGENIST Speaker Recognition Evaluation (SRE)
- INFOAre there any suggestions for free databases for speaker recognition?
- DATAVoxCeleb2: Deep Speaker Recognition
Speech Reading
- See Lip-reading
Speech Recognition
- See Automatic Speech Recognition
Speech Segmentation
- WIKISpeech_segmentation
- PAPERWord Segmentation by 8-Month-Olds: When Speech Cues Count More Than Statistics
- PAPERUnsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings
- PAPERUnsupervised Lexicon Discovery from Acoustic Input
- PAPERWeakly supervised spoken term discovery using cross-lingual side information
- DATACALLHOME Spanish Speech
Speech Synthesis
- WIKISpeech synthesis
- PAPERNatural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
- PAPERWaveNet: A Generative Model for Raw Audio
- PAPERTacotron: Towards End-to-End Speech Synthesis
- PAPERDeep Voice 3: 2000-Speaker Neural Text-to-Speech
- PAPEREfficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
- DATAThe World English Bible
- DATALJ Speech Dataset
- DATALessac Data
- CHALLENGEBlizzard Challenge 2017
- PRODUCTLyrebird
- PROJECTThe Festvox project
- TOOLKITMerlin: The Neural Network (NN) based Speech Synthesis System
Speech Enhancement
- WIKISpeech enhancement
- BOOKSpeech enhancement: theory and practice
- PAPERAn Experimental Study on Speech Enhancement BasedonDeepNeuralNetwork
- PAPERA Regression Approach to Speech Enhancement BasedonDeepNeuralNetworks
- PAPERSpeech Enhancement Based on Deep Denoising Autoencoder
Speech-To-Text
- See Automatic Speech Recognition
Spoken Term Detection
- See Speech Segmentation
Stemming
- WIKIStemming
- PAPERA BACKPROPAGATION NEURAL NETWORK TO IMPROVE ARABIC STEMMING
- TOOLKITNLTK Stemmers
Term Extraction
- WIKITerminology extraction
- PAPERNeural Attention Models for Sequence Classification: Analysis and Application to Key Term Extraction and Dialogue Act Detection
Text Similarity
- WIKISemantic similarity
- PAPERA Survey of Text Similarity Approaches
- PAPERLearning to Rank Short Text Pairs with Convolutional Deep Neural Networks
- PAPERImproved Semantic Representations From Tree-Structured Long Short-Term Memory Networks
- CHALLENGESemEval-2014 Task 3: Cross-Level Semantic Similarity
- CHALLENGESemEval-2014 Task 10: Multilingual Semantic Textual Similarity
- CHALLENGESemEval-2017 Task 1: Semantic Textual Similarity
- WIKISemantic Textual Similarity Wiki
Text Simplification
- WIKIText simplification
- PAPERAligning Sentences from Standard Wikipedia to Simple Wikipedia
- PAPERProblems in Current Text Simplification Research: New Data Can Help
- DATANewsela Data
Text-To-Speech
- See Speech Synthesis
Textual Entailment
- WIKITextual entailment
- PROJECTTextual Entailment with TensorFlow
- PAPERTextual Entailment with Structured Attentions and Composition
- CHALLENGESemEval-2014 Task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment
- CHALLENGESemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge
Transliteration
- WIKITransliteration
- INFOTransliteration of Non-Latin scripts
- PAPERA Deep Learning Approach to Machine Transliteration
- CHALLENGENEWS 2016 Shared Task on Transliteration of Named Entities
- PROJECTNeural Japanese Transliteration—can you do better than SwiftKey™ Keyboard?
Voice Conversion
- PAPERPHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION WITHOUT PARALLEL DATA TRAINING
- PROJECTDeep neural networks for voice conversion (voice style transfer) in Tensorflow
- PROJECTAn implementation of voice conversion system utilizing phonetic posteriorgrams
- CHALLENGEVoice Conversion Challenge 2016
- CHALLENGEVoice Conversion Challenge 2018
- DATACMU_ARCTIC speech synthesis databases
- DATATIMIT Acoustic-Phonetic Continuous Speech Corpus
Voice Recognition
- See Speaker recognition
Word Embeddings
- WIKIWord embedding
- TOOLKITGensim: word2vec
- TOOLKITfastText
- TOOLKITGloVe: Global Vectors for Word Representation
- INFOWhere to get a pretrained model
- PROJECTPre-trained word vectors
- PROJECTPre-trained word vectors of 30+ languages
- PROJECTPolyglot: Distributed word representations for multilingual NLP
- PROJECTBPEmb: a collection of pre-trained subword embeddings in 275 languages
- CHALLENGESemEval 2018 Task 10 Capturing Discriminative Attributes
- PAPERBilingual Word Embeddings for Phrase-Based Machine Translation
- PAPERA Survey of Cross-Lingual Embedding Models
Word Prediction
- INFOWhat is Word Prediction?
- PAPERThe prediction of character based on recurrent neural network language model
- PAPERAn Embedded Deep Learning based Word Prediction
- PAPEREvaluating Word Prediction: Framing Keystroke Savings
- DATAAn Embedded Deep Learning based Word Prediction
- PROJECTWord Prediction using Convolutional Neural Networks—can you do better than iPhone™ Keyboard?
- CHALLENGESemEval-2018 Task 2, Multilingual Emoji Prediction
Word Segmentation
- WIKIWord segmentation
- PAPERNeural Word Segmentation Learning for Chinese
- PROJECTConvolutional neural network for Chinese word segmentation
- TOOLKITStanford Word Segmenter
- TOOLKITNLTK Tokenizers