GSoC
GSoC copied to clipboard
Run Extraction Framework
Effort
1-2 days
Skills
basic maven, executing README file
Description
The DBpedia extraction framework can download a set of Wikipedia XML dumps and extract facts. There is a configuration file where you specify the language(s) you want and just run it. Setup your download & extract configuration files and run a simple dump-based extraction.
Impact
Get to know the was the extraction framework works.
Hi. This is in reference to issue #24 . I downloaded the project and ran a dump-based extraction. Everything went well, just faced a java issue(Had to make sure to use java1.8, used jenv for this.) before the extraction. However, I had to stop the ../run download download.10000.properties
command at
date page 'https://dumps.wikimedia.org/wikidatawiki/20190101/' has all files [pages-articles-multistream.xml.bz2]
downloading 'https://dumps.wikimedia.org/wikidatawiki/20190101/wikidatawiki-20190101-pages-articles-multistream.xml.bz2' to '/Users/anubhavujjawal/Desktop/data/extraction-data/2018-10/wikidatawiki/20190101/wikidatawiki-20190101-pages-articles-multistream.xml.bz2' read 28.0153 MB of 58.74201 GB in 01:52 min
since I didn't have the available bandwidth and space(I use a macbook air 128 GB model) to complete this. After it, I ran ../run extraction extraction.default.properties
ran well. Have I messed anything up?
Hi everyone and @mgns . I tried the instructions of the dump-based-extraction here on both of the download.10000.properties and download.minimal.properties download config file and got an error "Caused by: java.lang.IllegalArgumentException: Base directory does not exist yet: \data\extraction-data\2018-10
" for both.
I tried to create directories from the root with /data/extraction-data/2018-10 but still got the error.
Is there any solution to this? Thank you very much.