Make-Information-Extraction-Great-Again icon indicating copy to clipboard operation
Make-Information-Extraction-Great-Again copied to clipboard

An (incomplete) overview of information extraction

Make-Information-Extraction-Great-Again

Welcome to contribute!!!

Contributor: Runxin Xu, Shuang Zeng

And thank Yuxuan Fan, Yifan Song for their suggestions!

Content

  • Named Entity Recognition
  • Coreference Resolution
  • Relation Extraction
  • Event Extraction
  • Joint Information Extraction

Named Entity Recognition

  • Sentence-level Named Entity Recognition
  • Chinese Named Entity Recognition
  • Few-shot Named Entity Recognition
  • Document-level Named Entity Recognition

Sentence-level Named Entity Recognition

What is it?

image

Given a sentence, the task aims at recognizing the entities and their entity types in the sentence. For example, Alpha is an Protein type entity in the figure.

What are the challenges?

  • How to handle the nested entities?
  • How to handle the discontinuous entities?

Mainstream methods?

Datasets?

Chinese Named Entity Recognition

What is it?

img

Different from English sentence, Chinese sentence features different sttructure of characters and words, and does not has explicity word boundary, which makes it more challenging.

What are the challenges?

  • How to better utilize the character-word lattice structure to enhance Chinese NER?
  • How to accelerate the model so that we can fully utilize the parallel computation of GPUs with a high inference speed?

Mainstream methods?

Datasets?

Few-shot Named Entity Recognition

What is it?

img

The few-shot NER problem is usually modeled as a N-way-K-shot task following traditional meta-learning paradigm.

What are the challenges?

  • How to precisely capture the characteristics of different entity types given few training data?
  • How to handle the imbalance of entity types (the domination of O class)?

Mainstream methods?

Datasets?

Document-level Named Entity Recognition

What is it?

img

Given a document that consists of multiple sentences, document-level NER aims at recognizing entities in the whole document.

What are the challenges?

  • How to better utilize the document-level context?
  • How to maintain the consistency of entities occurring multiple times across the document?

Mainstream methods?

Datasets?

Coreference Resolution

What is it?

Coreference resolution is the task of clustering mention spans in text that refer to the same underlying real world entities.

Example:

image

"I", "my", and "she" belong to the same cluster and "Obama" and "he" belong to the same cluster.

In this task, mention span may be named entity mention, pronoun, verb, etc.

What are the challenges?

  • How to better represent mention span?
  • How to mitigate the number of enormous invalid mention spans?
  • How to model contextual representation?
  • How to distinguish pronoun coreference and entity coreference?

Mainstream methods?

image

  • Entity Mention Resolution
  • Pronoun Resolution

Entity Mention Resolution

Pronoun Resolution

Datasets?

Relation Extraction

  • Sentence-level Relation Extraction
  • Distant Supervised Relation Extraction
  • Few-shot Relation Extraction
  • Document-level Relation Extraction

Sentence-level Relation Extraction

What is it?

image Given a sentence, the task aims at extracting the (head entity, relation, tail entity) triples out of the sentence.

As illustrated in the figure, we extract (United States, Country_president, Trump) and (Apple Inc, Company_CEO, Tim Cook) out of the sentence.

What are the challenges?

  • How to better understand the semantics of the sentence?
  • How to better take advantage of the interactions between entity recognition and relation extraction?
  • How to handle Single-Entity-Overlap (SEO) problem, which means two different relation triples have an entity overlap?
  • How to handle Entity-Pair-Overlap (EPO) problem, which menas two different relation triples have the same entity pairs?
  • How to handle the dependency and interactions between different relations?

Mainstream methods?

Datasets?

Distant Supervised Relation Extraction

What is it?

image

Distant supervised RE aims at annotate the unlabeled sentences with the large-scale knowledge base. If the sentence has two entities that also occur in the knowledge graph and have a specific relation, we assume that this sentence exactly expresses such relation. Obviously, it inevitably brings noise (false positive examples).

image

People usually formulate the problem as a bag-level RE task. A bag means multiple sentences with the same entity pair. Given a bag, we have to correctly predict the relations between such entity pair, while some sentences in the bag may be the false positive examples.

What are the challenges?

  • How to better filter those false positive examples (noise) and decrease the impact of them?
  • How to better make the most of the information of different sentences with the same entity pairs (or in the same bag)?

Mainstream methods?

Datasets?

Few-shot Relation Extraction

What is it?

image

The few-shot RE problem is usually modeled as a N-way-K-shot task following traditional meta-learning paradigm. Given N relation types with K instances, the task aims at predicting which relation type the query (test) instance belongs to.

What are the challenges?

  • How to precisely capture the characteristics of different relation types given few training data?
  • How to better consider the interactions among instances within support set?
  • How to better consider the interactions among instances in support set and the query?
  • How to make our model more robust to deal with the noise?

Mainstream methods?

Datasets?

Document-level Relation Extraction

What is it?

image

Given a document that consists of multiple sentences, the task aims at extracting relation triples out of the document.

What are the challenges?

  • How to better capture the semantics of the whole long document?
  • An entity usually has many mentions across the document.
  • How to handle the inter-sentence relations, which means the head entity and the tail entity do not locate in the same sentence and may be far away from each other?
  • How to handle the reasoning among relations?

Mainstream methods?

Datasets?

Event Extraction

  • Sentence-level Event Extraction
  • Distant-supervised Event Extraction
  • Few-shot Event Extraction
  • Document-level Event Extraction
  • Relations Among Events

Sentence-level Event Extraction

What is it?

image

Given a sentence, the task aims at handling four sub-tasks:

  • Event Detection
    • Trigger Identification: Identify the trigger that triggers a specific event (e.g., held and come in the figure).
    • Event Classification: Predict what event type the trigger has triggerred (e.g., Meet and Transport in the figure).
  • Event Argument Extraction
    • Argument Identification: Identify potential arguments for a specific event (e.g., Thailand, 2024, and etc. in the figure).
    • Argument Role Prediction: Predict what role the argument plays in the event (e.g., place, time, and etc. in the figure).

What are the challenges?

  • How to better understand the semantics of the sentence?
  • How to deal with the error propagation problems?
  • How to capture the dependency and interaction between different events?
  • How to capture the dependency and interaction between different arguments of the same event?

Mainstream methods?

Datasets?

Distant Supervised Event Extraction

What is it?

image

Distant-supervised EE annotate the unlabeled sentence with the help of the large-scale knowledge base. As illustrated in the figure, we annotate the sentence from Wikipedia with the knowledge in Freebase.

What are the challenges?

  • How to deal with the noise brought by distant supervision?

Mainstream methods?

Datasets?

Few-shot Event Extraction

What is it?

image

Few-shot Event Extraction is usually formulated as a N-way-K-shot meta-learning problem. Given N event types with K instances, the task aims at predicting which event type the query instance belongs to.

What are the challenges?

  • How to precisely capture the characteristics of different relation types given few training data?
  • How to better consider the interactions among instances within support set?
  • How to better consider the interactions among instances in support set and the query?
  • How to address the trigger bias and avoid overfitting?

Mainstream methods?

Datasets?

Document-level Event Extraction

What is it?

image

Given a document that consists of multiple sentences, the task aims at extracting events out of the whole document.

What are the challenges?

  • How to better capture the semantics of the whole long document?
  • How to handle the cross-sentence events, which means the event arguments scattering across different sentences?
  • How to capture the interdependency among different events and different argument roles?

Mainstream methods?

Datasets?

Relations Among Events

What is it?

image

Given a text, the tasks aims at predicting the relation among different events that are expressed by the text. The relation types usually include coreference, causality, temporal, sub-event, and etc.

What are the challenges?

  • How to better understand the relations between different events?
  • How to better consider the global constraints between different event relations?

Mainstream methods?

Datasets?

Joint Information Extraction

  • []
  • surveys
    • todos

What is it?

image

Joint Information Extraction aims at handling multiple extraction tasks simultaneously, including named entity recognition, relation extraction, event extraction, and etc. As illustrated in the figure, the entities are recognized, with the relation triples (man, ART, taxicab), (man, PHYS, checkpoint), (soldiers, PHYS, checkpoint) and the Transport events with Artifact and Destination arguments.

What are the challenges?

  • How to achieve satisfactory performance for each sub-tasks?
  • How to make the most of the interactions among different sub-tasks?
  • How to derive a globally optimal extraction results, with consideration of some global constraints?

Mainstream methods?

Datasets?

Others

  • Open-domain Information Extraction
  • Type/Schema Induction
  • ...