KGist
KGist copied to clipboard
Knowledge Graph summarization for anomaly/error detection & completion (WebConf '20)
KGist: Knowledge Graph Summarization for Anomaly Detection & Completion
Caleb Belth, Xinyi Zheng, Jilles Vreeken, and Danai Koutra. What is Normal, What is Strange, and What is Missing in a Knowledge Graph: Unified Characterization via Inductive Summarization. ACM The Web Conference (WWW), April 2020. [Link to the paper]
If used, please cite:
@inproceedings{belth2020normal,
title={What is Normal, What is Strange, and What is Missing in a Knowledge Graph: Unified Characterization via Inductive Summarization},
author={Belth, Caleb and Zheng, Xinyi and Vreeken, Jilles and Koutra, Danai},
booktitle={Proceedings of The Web Conference 2020},
pages={1115--1126},
year={2020}
}
Presentation: https://youtu.be/Ql7VEfliPXo
Setup
git clone [email protected]:GemsLab/KGist.gitcd data/unzip nell.zipunzip dbpedia.zipcd ../src/cd test/python tester.py
Requirements
Python 3numpyscipynetworkx
Data
Nell and DBpedia are zipped in the data/ directory. Yago is too big to distribute via Github.
{KG_name}.txt format: space separated, one triple per line.
s1 p1 o1
s2 p2 o2
...
{KG_name}_labels.txt format: space separated, one entity per line followed by a variable number of labels, also space separated.
e1 l1 l2 ...
e2 l1 l2 l3 ...
...
Example usage (from src/ dir)
Command Line
python main.py --graph nell
Interface
from graph import Graph
from searcher import Searcher
from model import Model
# load graph
graph = Graph('nell', idify=True)
# create a Searcher object to search for a model (set of rules)
searcher = Searcher(graph)
# build initial model
model = searcher.build_model()
model.print_stats()
# perform rule merging refinement
model = model.merge_rules()
model.print_stats()
# perform rule nesting refinement
model = model.nest_rules()
model.print_stats()
To compute anomaly scores for triples as in Section 4.3:
from anomaly_detector import AnomalyDetector
# construct an anomaly detector with the KGist model
anomaly_detector = AnomalyDetector(model)
# an edge/triple to score
edge = ('concept:company:limited_brands', 'concept:companyceo', 'concept:ceo:leslie_wexner')
anomaly_detector.score_edge(edge)
>>> 26.5164
Larger numbers mean more anomalous. Note that in our experiments in Section 5.2, we used KGist+m, which would be the model without running model.nest_rules().
Arguments
--graph {KG_name} Expects {KG_name}.txt and {KG_name}_labels.txt to be in data/ directory in format as described above for NELL and DBpedia.
--rule_merging / -Rm True/False (Optional; Default = False) Use rule merging refinement (Section 4.2.2)
--rule_nesting / -Rn True/False (Optional; Default = False) Use rule nesting refinement (Section 4.2.2)
--idify / -i True/False (Optional; Default = True) Convert entities and predicates to integer ids internally for faster processing
--verbosity / -v [0, infinity) (Optional; Default = 1,000,000) How frequently to log progress (use integers)
--output_path / -o (Optional; Default = 'output/') What directory to write the output to (log will still be printed to stdout)
Output
output/{KG_name}_model.picklesaves a Model object.output/{KG_name}_model.rulessaves the rules, which are recursively defined, in parenthetical form.
Frequently Asked Questions (FAQ)
I want to run KGist on my own dataset. How did you construct the labels file?
We constructed the labels file by moving the rdf:type triples to the labels file. Thus, if, for example, there are triples (LaRose, rdf:type, book) and (LaRose, rdf:type, novel) in the KG, then LaRose book novel would be a row in the labels file.
Comments or Questions
Contact Caleb Belth with comments or questions: [email protected]