CIKM-20
CIKM-20 copied to clipboard
CauseNet: Towards a Causality Graph Extracted from the Web
CauseNet Source Code for Analysis & Extraction
This source code forms the basis for our CIKM 2020 paper CauseNet: Towards a Causality Graph Extracted from the Web. The code is divided into two components: one component for analyzing the graph and another component for extracting the graph from the web. The final graph can be downloaded from causenet.org. When using the code, please make sure to refer to it as follows:
@inproceedings{heindorf2020causenet,
author = {Stefan Heindorf and
Yan Scholten and
Henning Wachsmuth and
Axel-Cyrille Ngonga Ngomo and
Martin Potthast},
title = {CauseNet: Towards a Causality Graph Extracted from the Web},
booktitle = {{CIKM}},
publisher = {{ACM}},
year = {2020}
}
Overview
Project structure
We assume the following project structure:
CIKM-20/
├── java
│ ├── bootstrapping
│ └── extraction
├── notebooks
│ ├── 01-concept-spotting
│ │ ├── 01-texts-training.ipynb
│ │ ├── 02-texts-spotting-wikipedia.ipynb
│ │ ├── 03-texts-spotting-clueweb.ipynb
│ │ ├── 04-infoboxes-training.ipynb
│ │ ├── 05-infoboxes-spotting.ipynb
│ │ ├── 06-lists-training.ipynb
│ │ └── 07-lists-spotting.ipynb
│ ├── 02-graph-construction
│ │ └── 01-graph-construction.ipynb
│ ├── 03-graph-analysis
│ │ ├── 01-knowledge-bases-overview.ipynb
│ │ └── 02-graph-statistics.ipynb
│ └── 04-graph-evaluation
│ ├── 01-graph-evaluation-precision.ipynb
│ ├── 02-qa-corpus-construction.ipynb
│ └── 03-graph-evaluation-recall.ipynb
└── data/
├── bootstrapping
│ ├── 0-instances
│ ├── 0-patterns
│ ├── 1-instances
│ ├── 1-patterns
│ ├── 2-instances
│ ├── 2-patterns
│ └── seeds.csv
├── question-answering/
├── causality-graphs/
│ ├── extraction
│ │ ├── clueweb
│ │ └── wikipedia
│ ├── spotting
│ │ ├── clueweb
│ │ └── wikipedia
│ ├── integration
│ ├── causenet-full.jsonl.bz2
│ ├── causenet-precision.jsonl.bz2
│ └── causenet-sample.json
├── categorization
├── random
├── concept-spotting
│ ├── infoboxes
│ ├── lists
│ └── texts
├── flair-models
│ ├── infoboxes
│ ├── lists/
│ └── texts/
├── lucene-index/
└── external
├── extraction-sources
│ ├── clueweb12
│ └── wikipedia
├── knowledge-bases
│ ├── conceptnet-assertions-5.6.0.csv
│ ├── freebase-rdf-latest.gz
│ └── wikidata-20181001-all.json.bz2
├── msmarco
├── nltk
├── stop-word-lists
├── spacy
└── stanfordnlp
Prerequisites
We recommend Miniconda for easy installation on many platforms.
- Create new environment:
conda env create -f environment.yml
- Activate environment:
conda activate cikm20-causenet
- Install Kernel:
python -m ipykernel install --user --name cikm20-causenet --display-name cikm20-causenet
- Start Jupyter:
jupyter notebook
CauseNet: Analysis
The code was tested with Python 3.7.3, under Linux 4.9.0-8-amd64 with 16 cores and 256 GB RAM.
Overview of causal relations in knowledge bases
Overview of causal relations in knowledge bases as provided by Table 1.
Required Input Data
- CauseNet-Full (output of the extraction component)
data/causality-graphs/causenet-full.jsonl.bz2
- Freebase
data/external/knowledge-bases/freebase-rdf-latest.gz
- ConceptNet (version 5.6.0)
data/external/knowledge-bases/conceptnet-assertions-5.6.0.csv
- Wikidata
data/external/knowledge-bases/wikidata-20181001-all.json.bz2
Execution
Execute the following notebook:
notebooks/03-graph-analysis/
└── 01-knowledge-bases-overview.ipynb
CauseNet: Graph Analysis
Required Input Data
- CauseNet-Full (output of the extraction component)
data/causality-graphs/integration/causenet-full.jsonl.bz2
- Manual categorization
/data/categorization/manual_categorization.csv
- Wikipedia extraction (Output of Wikipedia extraction)
data/causality-graphs/extraction/wikipedia/wikipedia-extraction.tsv
Execution
Execute the following notebook:
notebooks/03-graph-analysis/
└── 02-graph-statistics.ipynb
CauseNet: Graph Evaluation
Required Software
- DBpedia Spotlight
- Installation Instructions: https://github.com/dbpedia-spotlight/dbpedia-spotlight-model
- Required files:
-
https://sourceforge.net/projects/dbpedia-spotlight/files/spotlight/dbpedia-spotlight-1.0.0.jar
-
https://sourceforge.net/projects/dbpedia-spotlight/files/2016-10/en/model/en.tar.gz
-
Required Input Data
- CauseNet-Full (output of the extraction component)
data/causality-graphs/integration/causenet-full.jsonl.bz2
- Random numbers for reproducibility:
data/random/generated_random_numbers.bz2
-
MSMARCO (version: 2.1):
data/external/msmarco/train_v2.1.json
data/external/msmarco/dev_v2.1.json
- ConceptNet (version 5.6.0)
data/external/knowledge-bases/conceptnet-assertions-5.6.0.csv
- Wikidata
data/external/knowledge-bases/wikidata-20181001-all.json.bz2
Execution
Execute the following notebooks:
notebooks/04-graph-evaluation/
├── 01-graph-evaluation-precision.ipynb
├── 02-qa-corpus-construction.ipynb
└── 03-graph-evaluation-recall.ipynb
Computed Output Data
02-qa-corpus-construction.ipynb
will extract simple causal questions from MSMARCO:
question-answering/
├── causality-qa-training.json
└── causality-qa-validation.json
CauseNet: Graph Extraction
The graph extraction is structured as follows:
-
Bootstrapping Component (Java):
- generates linguistic patterns from Wikipedia sentences using a bootstrapping approach
- Extraction Component (Java):
- uses linguistic patterns to extract causal relations from the following sources:
- Extracting from Wikipedia
- Extracting from ClueWeb12
- uses linguistic patterns to extract causal relations from the following sources:
-
Causal Concept Spotting (Python):
- training sequence taggers for sentences, infoboxes and lists
- spotting causal concepts in extractions of previous step
-
Graph construction (Python):
- final construction and reconciliation steps
The code was tested with Java 8 and Python 3.7.3, under Linux 4.9.0-8-amd64 with 16 cores and 256 GB RAM.
Bootstrapping Component
Required Input Data
- Bootstrapping seeds:
data/bootstrapping/seeds.csv
- Lucene index with preprocessed Wikipedia sentences:
data/lucene-index/
Execution
- Compile:
mvn package -f ./java/bootstrapping/pom.xml
- Execute:
./scripts/bootstrapping.sh
Computed Output Data
The bootstrapping component will compute the following files:
data/bootstrapping/
├── 0-instances
├── 0-patterns
├── 1-instances
├── 1-patterns
├── 2-instances
└── 2-patterns
The following components will use the patterns after the second iteration: data/bootstrapping/2-patterns
.
Extraction Component: Wikipedia
Input Data
- Wikipedia XML dump:
data/external/extraction-sources/wikipedia/enwiki-20181001-pages-articles.xml
- Patterns of the second bootstrapping iteration:
data/bootstrapping/2-patterns
Execution
- Compile:
mvn package -f ./java/extraction/pom.xml
- Execute:
./scripts/extraction-wikipedia.sh
Computed Output Data
- Causal relations extracted from texts, infoboxes and lists:
data/causality-graphs/extraction/ └── wikipedia └── wikipedia-extraction.tsv
Extraction Component: ClueWeb12
We provide code to parse one ClueWeb12 file. To parse the entire ClueWeb12 corpus, you can integrate this code into your cluster software.
Input Data
- ClueWeb12 file in WARC format:
data/external/extraction-sources/clueweb12/0013wb-88.warc.gz
- Patterns of the second bootstrapping iteration:
data/bootstrapping/2-patterns
- Stop word list for parsing webpages:
data/external/stop-word-lists/enStopWordList.txt
Execution
- Compile:
mvn package -f ./java/extraction/pom.xml
- Execute:
./scripts/extraction-clueweb12.sh
Computed Output Data
- Causal relations extracted from webpage texts:
data/causality-graphs/extraction/ └── clueweb12 └── clueweb12-extraction.tsv
Causal Concept Spotting
Models were trained on a NVIDIA GeForce GTX 1080 Ti (11 GByte). To reproduce the results, we recommend to use a similar GPU architecture. If you do not want to retrain the models, you can use our models: /data/flair-models/
Required Software
No manual steps required. The correct versions will be automatically installed if you use the provided environment.yml
.
For completeness:
- Flair (version: 0.4.2)
- Stanford Parser (version: 0.2.0) (The following bug should be fixed: https://github.com/stanfordnlp/stanza/issues/135)
-
Spacy (version: 2.1.8)
- Model version: 2.1.0
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz
- Model version: 2.1.0
Required Input Data
- Concept Spotting datasets:
/data/concept-spotting/
: This folder contains the manually annotated training and evaluation data for the concept spotting. - Output data of the extraction components:
data/causality-graphs/extraction/ ├── clueweb12 │ └── clueweb12-extraction.tsv └── wikipedia └── wikipedia-extraction.tsv
Execution
Execute the following notebooks:
notebooks/01-spotting/
├── 01-texts-training.ipynb
├── 02-texts-spotting-wikipedia.ipynb
├── 03-texts-spotting-clueweb.ipynb
├── 04-infoboxes-training.ipynb
├── 05-infoboxes-spotting.ipynb
├── 06-lists-training.ipynb
└── 07-lists-spotting.ipynb
Computed Output Data
- Flair models for sequence labeling:
/data/flair-models/
- Separate causality graphs:
data/causality-graphs/spotting/ ├── clueweb12 │ └── clueweb-graph.json └── wikipedia ├── infobox-graph.json ├── list-graph.json └── text-graph.json
Graph Construction
Required Input Data
data/causality-graphs/spotting/
├── clueweb12
│ └── clueweb-graph.json
└── wikipedia
├── infobox-graph.json
├── list-graph.json
└── text-graph.json
Execution
Execute the following notebook:
notebooks/02-graph-construction/
└── 01-graph-construction.ipynb
Computed Output Data
data/causality-graphs/integration/
└── causenet-full.jsonl.bz2
Contact
For questions and feedback please contact:
Stefan Heindorf, Paderborn University
Yan Scholten, Technical University of Munich
Henning Wachsmuth, Paderborn University
Axel-Cyrille Ngonga Ngomo, Paderborn University
Martin Potthast, Leipzig University
License
The code by Stefan Heindorf, Yan Scholten, Henning Wachsmuth, Axel-Cyrille Ngonga Ngomo, Martin Potthast is licensed under a MIT license.