wikipedia-to-elastic
wikipedia-to-elastic copied to clipboard
Analyze and extract Wikipedia article text and attributes and store them into an ElasticSearch index or to json files (multilingual support)
Wikipedia to ElasticSearch
This project generates an ElasticSearch, or file index from Wikipedia (xml dumps). The process will analyze, extract and store Wikipedia article text and several distinct Wikipedia attributes and relations (detailed below).
Project Features:
- Export Wikipedia in different languages {English, French, Spanish, German, Chinese}
- Export other Wikimedia resources: {Wikipedia, Wikinews, Wikidata}
- Support storing to either an Elastic index or file system (json files)
- Support the extraction of Wikipedia article text clean of markdown and html tags
- Integrated with Intel NLP Architect
- Used in research publication: WEC: Wikipedia Event Coreference
*Relations integrity tested only for English. Other languages might require some adjustments.
Table Of Contents
Introduction
Special Wikipedia Resources and Attributes
3 different types of Wikipedia pages are used: {Redirect/Disambiguation/Title} in order to extract 6 different semantic features for tasks such as Identifying Semantic Relations, Entity Linking, Cross Document Co-Reference, Knowledge Graphs, Summarization and other.
- Redirect Links - See details at Wikipedia Redirect
- Disambiguation Links - See details at Wikipedia Disambiguation
- Category Links - See details at Wikipedia Category
- Link Title Parenthesis - See details at paper "Extracting Lexical Reference Rules from Wikipedia"
- Infobox - See details at Wikipedia Infobox
- Term Frequency (TBD/WIP) - Hold a map of term frequency for computing TFIDF on Wikipadia
Supported Relations Types
Listed below the Wikidata properties which can extend above attributes by running the Wikidata postprocess described below.
Click relation for further details:
Prerequisites
- Java 11
- Wikipedia xml.bz2 dump file in required language (For example latest en XML dump)
- Optional: ElasticSearch 7.17.4 (needed when exporting to an elastic index)
- Recommended: Set Elastic using docker (docker/README.md)
- Alternative:
- Install Elastic from the official Elasticsearch site
- Install plugins: analysis-icu, analysis-smartcn (guide)
- Optional: Wikidata json.bz2 dump file (latest JSON dump)
Configuration
Main Configuration File
conf.json
is the main process configuration file:
-
exportMethod
- Whether to export to Elastic Index (set toelastic
) or json files (then set tojson_files
) -
extractRelationFields
- When set totrue
will extract the relations fields (listed inrelationTypes
) while processing the data (support only with english Wikipedia) -
wikipediaDump
- Wikipedia .bz2 downloaded dump file location -
lang
- Support {en
(English),fr
(French),es
(Spanish),de
(German),zh
(Chinese)} -
includeRawText
- When set totrue
, will include original wikipedia page text (including html and markdown), parsed and clean as possible -
includeParsedParagraphs
- When set totrue
, will include a list of parsed wikipedia article paragraphs, clean of any markdown or html tags -
relationTypes
- ["Category", "Infobox", "Parenthesis", "PartName"]. To export those relations, theextractRelationFields
configuration need to be set totrue
(the full list of available relations is in/src/main/java/wiki/data/relations/RelationType.java
)
Json Export Configuration File
config/json_file_conf.json
is the configuration needed only when the exportMethod
is set to json_files
-
outIndexDirectory
- The folder location where to save the exported files -
pagesPerFile
- How many pages to save per file (100,000 pages ~ 0.5 GB)
Elastic Configuration Files
Main Elastic Configuration File
config/elastic_conf.json
- Those configurations are needed only when the exportMethod
is set to elastic
-
indexName
- Set your desired Elastic Search index name -
docType
- Set your desired Elastic Search documnent type -
insertBulkSize
- Number of pages to bulk insert to elastic search every iteration (found1000
to give best preformence) -
mapping
- Elastic Mapping file, should point to src/main/resources/mapping.json -
setting
- Elastic Setting file, current support {en, fr, es, de, zh} -
host
- Elastic host -
port
- Elastic port -
scheme
- Elastic host schema (default:http
) -
shards
- Number of Elastic shards -
replicas
- Number of Elastic replicas
Elastic Mapping File
src/main/resources/mapping.json
- Elastic wiki index mapping (Should probably stay unchanged)
Elastic Index Files
-
src/main/resources/{en,es,fr,de,zh}_map_settings.json
- Elastic index settings (Should probably stay unchanged) -
src/main/resources/lang/{en,es,fr,de,zh}.json
- language specific configuration for relation key word translations -
src/main/resources/stop_words/{en,es,fr,de,zh}.txt
- language specific stop-words list
Build Run and Test
-
Make sure Elastic process is running and active on your host (if running Elastic locally your IP is http://localhost:9200/)
-
Checkout/Clone the repository
-
From command line navigate to project root directory and run:
./gradlew clean build -x test
Should get a message saying:BUILD SUCCESSFUL in 7s
-
Extract the build zip file created at this location
build/distributions/WikipediaToElastic-1.0.zip
-
Put wiki xml.bz2 dump file (no need to extract the bz2 file!) in:
dumps
folder
Recommendation: Start with a small wiki dump, make sure you like what you get (or modify configurations to meet your needs) before moving to a full blown 15GB dump export. -
Make sure
conf.json
configurations are set as expected -
Make sure
config
folder configurations are set as expected -
Run the process from command line:
java -Xmx6000m -DentityExpansionLimit=2147480000 -DtotalEntitySizeLimit=2147480000 -Djdk.xml.totalEntitySizeLimit=2147480000 -jar build/distributions/WikipediaToElastic-1.0/WikipediaToElastic-1.0.jar
-
To test/query, you can run from terminal:
curl -XGET 'http://localhost:9200/enwiki_v3/_search?pretty=true' -H 'Content-Type: application/json' -d '{"size": 5, "query": {"match_phrase": { "title.near_match": "Alan Turing"}}}'
-
Should return a wikipedia page on Alan Turing
Integrating Wikidata Attributes
Running this process require a Wikipedia index (generated by the above process)
Wikidata Main Configuration File (config/wikidata_conf.json
)
Main configuration file for Wikidata export process, currently only support if Wikipedia was export to an Elasticsearch index.
- indexName - Elasticsearch index to enhance with wikidata attributes
- docType - Set your desired documnent type
- insertBulkSize - Number of pages to bulk insert to elastic search every iteration
- host - Elastic host
- port - Elastic port
- wikidataDump - Wikidata .bz2 downloaded dump file location
- scheme - Elastic host schema
- lang - should correlate with the wikipedia index language
Wikidata Running and Testing
-
Make sure Elastic process is running and active on your host (if running Elastic locally your IP is http://localhost:9200/)
-
Make sure
wikidata_conf.json
configuration are set as expected -
Run the process from command line:
java -cp WikipediaToElastic-1.0.jar wiki.wikidata.WikiDataFeatToFile
Process will read the full wikidata dump, parse, extract the relations and merge them relative wikipedia data in search index. Process might take a while to finish. -
To test/query, you can run from terminal:
curl -XGET 'http://localhost:9200/enwiki_v3/_search?pretty=true' -H 'Content-Type: application/json' -d '{"size": 5, "query": {"match_phrase": { "title.near_match": "Alan Turing"}}}'
This should return a wikipedia page on Alan Turing including the new Wikidata relations.
Usage
Elastic Page Query
Once process is complete, two main query options are available (for more details and title query options, see mapping.json
):
- title.plain - fuzzy search (sorted)
- title.keyword - exact match
Generated Elastic Page Example
Pages that have been created with the following structures (also see "Created Fields Attributes" for more details):
Page Example (Extracted from Wikipedia disambiguation page):
{
"_index": "enwiki_v3",
"_type": "wikipage",
"_id": "40573",
"_version": 1,
"_score": 20.925367,
"_source": {
"title": "NLP",
"text": "{{wiktionary|NLP}}\n\n'''NLP''' may refer to:\n\n; .....",
"relations": {
"isPartName": false,
"isDisambiguation": true,
"disambiguationLinks": [
"Natural language programming",
"New Labour",
"National Library of the Philippines",
"Neuro linguistic programming",
"Natural language processing",
"National Liberal Party",
"Natural Law Party",
"National Labour Party",
"Normal link pulses",
"New Labour Party"
],
"categories": [
"disambiguation"
],
"infobox": "",
"titleParenthesis": [],
"partOf": [],
"aliases": [
"LmxM36.1060"
],
"hasPart": [],
"hasEffect": [],
"hasCause": [],
"hasImmediateCause": []
}
}
}
Page Example (Extracted from Wikipedia redirect page):
{
"_index": "enwiki_v3",
"_type": "wikipage",
"_id": "2577248",
"_version": 1,
"_score": 20.925367,
"_source": {
"title": "Nlp",
"text": "#REDIRECT",
"redirectTitle": "NLP",
"relations": {
"isPartName": false,
"isDisambiguation": false
}
}
}
Fields & Attributes
json field | Value | comment |
---|---|---|
_id | Text | Wikipedia page id |
_source.title | Text | Wikipedia page title |
_source.text | Text | Wikipedia page text |
_source.parsedParagraphs | Text | Clean of html/markdown Wikipedia article text split to passages |
_source.redirectTitle | Text (optional) | Wikipedia page redirect title |
_source.relations.infobox | Text (optional) | The article infobox element |
_source.relations.categories | List (optional) | Categories relation list |
_source.relations.isDisambiguation | Bool (optional) | is Wikipedia disambiguation page |
_source.relations.isPartName | List (optional) | is Wikipedia page name description |
_source.relations.titleParenthesis | List (optional) | List of disambiguation secondary links |
_source.relations.aliases | List (optional) | Wikidata Rel |
_source.relations.partOf | List (optional) | Wikidata Rel |
_source.relations.hasPart | List (optional) | Wikidata Rel |
_source.relations.hasEffect | List (optional) | Wikidata Rel |
_source.relations.hasCause | List (optional) | Wikidata Rel |
_source.relations.hasImmediateCause | List (optional) | Wikidata Rel |