extraction-framework
extraction-framework copied to clipboard
abstract generation
Three extractors: .nifExtractor .abstractExtractor .abstractExtractorWikipedia
- all three produce mostly the same, except for some, i.e. Joe Biden
- unclear whether they use the wikidump or mediawiki api
- overall many quality issues
Berlin (; German: [bɛʁˈliːn] ())
Added Joe Biden to Minidump.
- the only difference is that data extracted with AbstractExtractor contains some additional data like
<normalized><n from=\"Joe_Biden\" to=\"Joe Biden\" /></normalized>
. - Besides that all three extractors produce
Joseph Robinette Biden Jr. ( BY-dən; born November 20, 1942) is an American politician who is the 46th president of the United States.
Missing text is related to wikipage version (used to produce 2020.07 text)
- Current Joe Biden wikitext page: https://en.wikipedia.org/w/index.php?title=Joe_Biden&action=edit
- 31st of July Joe Biden wikitext page: https://en.wikipedia.org/w/index.php?title=Joe_Biden&action=edit&oldid=970531146 (edited)
Discovered that the wikitext uses two templates (IPac-en
template and respell
template)
- The
IPAc-en
is ignored
So, I guess I found a way how to include data from the IPAc-en template. But before I will just explain some moments about NifExtractor
:
As I understand, NifExtractor transforms xml-wikitext to html code, so here we will have a code with tags and css. Then there are some transformation methods that are cleaning the html document. One of these methods is getJsoupDoc
in HtmlNifExtractor
. This method includes removing some items from html. These items, that must be removed, are described in the nifextractionconfig.json
file, and it contains a list nif-remove-elements
. This list is the list with html (classes or something like this) elements that must be removed from our html wikipage document and one of those classes is noexcerpt
. And noexcerpt class
is used for IPA templates in wikipedia html format pages. This is an example how it looks like:
<p><b>Joseph Robinette Biden Jr.</b> (<span class="rt-commentedText nowrap"><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="/ˈ/: primary stress follows">ˈ</span><span title="'b' in 'buy'">b</span><span title="/aɪ/: 'i' in 'tide'">aɪ</span><span title="'d' in 'dye'">d</span><span title="/ən/: 'on' in 'button'">ən</span></span>/</a></span></span> <a href="/wiki/Help:Pronunciation_respelling_key" title="Help:Pronunciation respelling key"><i title="English pronunciation respelling"><span style="font-size:90%">BY</span>-dən</i></a>; born November 20, 1942)
This is the line of code where it is removed as I understand: https://github.com/dbpedia/extraction-framework/blob/3a96d901d873bd144456db51e1170685cd16afff/core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala#L347
The link to the list of elements that are removed: https://github.com/dbpedia/extraction-framework/blob/3a96d901d873bd144456db51e1170685cd16afff/core/src/main/resources/nifextractionconfig.json#L14
So, I tried to extract data without removing noexcerpt
elements and now the extracted triples include transcriptions from IPA templates (/bɜːrˈlɪn/
, /ˈbaɪdən/
):
<http://dbpedia.org/resource/Berlin> <http://dbpedia.org/ontology/abstract> "Berlin (/bɜːrˈlɪn/; German: [bɛʁˈliːn] ()) is the capital and largest
<http://dbpedia.org/resource/Joe_Biden> <http://dbpedia.org/ontology/abstract> "Joseph Robinette Biden Jr. (/ˈbaɪdən/ BY-dən; born November 20, 1942)
But at the moment, I don't know if the extracted data includes some unnecessary data, because removing noexcerpt from nif-remove-elements list may also have some other consequences
Open todos:
- [ ] changes need to be updated in marvin-config: https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config in particular:
- [ ] -> the actual config
- [ ] -> docu update on the artifacts, see md files here: https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/-/tree/master/databus-poms/dbpedia/text
- [x] tests need to link to this issue