extraction-framework
extraction-framework copied to clipboard
The software used to extract structured data from Wikipedia
Consider this query ``` select * { ?x dbp:shortName ?y } limit 10 ``` It returns eg ``` dbr:Aluminium_Al-Mahdi_Hormozgan_VC "Aluminium"^^rdf:langString ``` That is nok: a string becomes langString by virtue...
I updated the minidump with `./createMinidump.sh` and noticed the following: 1. irregular paths: URLs from URI list are downloaded into individual files into subfolders: ``` LANGUAGE wikidata.org/wiki/Q75135502 TARGET: ../resources/minidumps/wikidata.org/wiki/Q75135502/wiki.xml ```...
When trying to activate ur for DIEFweb the following error occurs, since only mapped languages are supported. `Exception in thread "main" java.util.NoSuchElementException: no mapping namespace for language ur`
Some loose notes on what to integrate there: LinkExtractor has a TODO buggy, because of: ``` /ˈʃoʊpənhaʊ.ər/ ``` produces `(;` in short/long abstracts #2 Johannes said that live was not...
Seems like the current result will still contain errors some errors. We have to investigate later: 1. where the mistakes come from 2. or why Jena can't find them... ```...
Hi, not sure if intended, but looks like some properties in the [infobox-properties dataset](https://databus.dbpedia.org/marvin/generic/infobox-properties/2019.10.01) are quite long. And with long I mean very long ... Dataset: http://dbpedia-generic.tib.eu/release/generic/infobox-properties/2019.10.01/infobox-properties_lang=en.ttl.bz2 ``` bzcat infobox-properties_lang=en.ttl.bz2...
The following dbpedia files (and probably more) contain invalid literals https://downloads.dbpedia.org/repo/lts/generic/infobox-properties/2019.08.30/infobox-properties_lang%3den.ttl.bz2 https://downloads.dbpedia.org/repo/lts/generic/persondata/2019.08.30/persondata_lang%3den.ttl.bz2 with rdf:langString but without language tag. See: https://www.w3.org/TR/rdf11-concepts/#dfn-language-tagged-string All such files cannot be loaded using RDF4J as it...
http://live.dbpedia.org/page/Rowland_Flat,_South_Australia has invalid geocoordinates. This happens because: - Template:Infobox_Australian_place sets constant latDir (S) and longDir (E) hence those values are not specified in the wiki article - the mapping for...
There are cases in which the MappingExtractor cannot successfully extract an homepage for a resource, even if the infobox property is mapped to foaf:homepage. This happens because some Infoboxes require...
To avoid cases such as #119 I suggest we display better messages on the user. when we do not download a dump due to a download-complete file say something like:...