extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

abstract generation

Open kurzum opened this issue 3 years ago • 3 comments

Three extractors: .nifExtractor .abstractExtractor .abstractExtractorWikipedia

  • all three produce mostly the same, except for some, i.e. Joe Biden
  • unclear whether they use the wikidump or mediawiki api
  • overall many quality issues Berlin (; German: [bɛʁˈliːn] ())

kurzum avatar Apr 21 '21 09:04 kurzum

Added Joe Biden to Minidump.

  • the only difference is that data extracted with AbstractExtractor contains some additional data like <normalized><n from=\"Joe_Biden\" to=\"Joe Biden\" /></normalized>.
  • Besides that all three extractors produce
Joseph Robinette Biden Jr. ( BY-dən; born November 20, 1942) is an American politician who is the 46th president of the United States.

Missing text is related to wikipage version (used to produce 2020.07 text)

  • Current Joe Biden wikitext page: https://en.wikipedia.org/w/index.php?title=Joe_Biden&action=edit
  • 31st of July Joe Biden wikitext page: https://en.wikipedia.org/w/index.php?title=Joe_Biden&action=edit&oldid=970531146 (edited)

Discovered that the wikitext uses two templates (IPac-en template and respell template)

  • The IPAc-en is ignored

Vehnem avatar Apr 23 '21 10:04 Vehnem

So, I guess I found a way how to include data from the IPAc-en template. But before I will just explain some moments about NifExtractor:

As I understand, NifExtractor transforms xml-wikitext to html code, so here we will have a code with tags and css. Then there are some transformation methods that are cleaning the html document. One of these methods is getJsoupDoc in HtmlNifExtractor. This method includes removing some items from html. These items, that must be removed, are described in the nifextractionconfig.json file, and it contains a list nif-remove-elements. This list is the list with html (classes or something like this) elements that must be removed from our html wikipage document and one of those classes is noexcerpt . And noexcerpt class is used for IPA templates in wikipedia html format pages. This is an example how it looks like:

<p><b>Joseph Robinette Biden Jr.</b> (<span class="rt-commentedText nowrap"><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="/ˈ/: primary stress follows">ˈ</span><span title="&#39;b&#39; in &#39;buy&#39;">b</span><span title="/aɪ/: &#39;i&#39; in &#39;tide&#39;">aɪ</span><span title="&#39;d&#39; in &#39;dye&#39;">d</span><span title="/ən/: &#39;on&#39; in &#39;button&#39;">ən</span></span>/</a></span></span> <a href="/wiki/Help:Pronunciation_respelling_key" title="Help:Pronunciation respelling key"><i title="English pronunciation respelling"><span style="font-size:90%">BY</span>-dən</i></a>; born November 20, 1942) 

This is the line of code where it is removed as I understand: https://github.com/dbpedia/extraction-framework/blob/3a96d901d873bd144456db51e1170685cd16afff/core/src/main/scala/org/dbpedia/extraction/nif/HtmlNifExtractor.scala#L347

The link to the list of elements that are removed: https://github.com/dbpedia/extraction-framework/blob/3a96d901d873bd144456db51e1170685cd16afff/core/src/main/resources/nifextractionconfig.json#L14

So, I tried to extract data without removing noexcerpt elements and now the extracted triples include transcriptions from IPA templates (/bɜːrˈlɪn/, /ˈbaɪdən/):

<http://dbpedia.org/resource/Berlin> <http://dbpedia.org/ontology/abstract> "Berlin (/bɜːrˈlɪn/; German: [bɛʁˈliːn] ()) is the capital and largest

<http://dbpedia.org/resource/Joe_Biden> <http://dbpedia.org/ontology/abstract> "Joseph Robinette Biden Jr. (/ˈbaɪdən/ BY-dən; born November 20, 1942) 

But at the moment, I don't know if the extracted data includes some unnecessary data, because removing noexcerpt from nif-remove-elements list may also have some other consequences

jlareck avatar Apr 24 '21 22:04 jlareck

Open todos:

  • [ ] changes need to be updated in marvin-config: https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config in particular:
  • [ ] -> the actual config
  • [ ] -> docu update on the artifacts, see md files here: https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config/-/tree/master/databus-poms/dbpedia/text
  • [x] tests need to link to this issue

kurzum avatar Nov 16 '21 11:11 kurzum