extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

don't eat space after link

Open VladimirAlexiev opened this issue 10 years ago • 5 comments

A minor problem with text extraction: space after a link is eaten up.

Eg https://bg.wikipedia.org/w/index.php?title=Джон_Кенеди&action=edit includes:

| description = [[Ich bin ein Berliner|Речта]] от Ратхаус Шьонеберг на Джон Кенеди, 26 юни 1963. Продължителност 9:01.

This is extracted http://mappings.dbpedia.org/server/extraction/bg/extract?title=Джон+Кенеди&revid=&format=turtle-triples&extractors=custom as:

description "Речтаот Ратхаус Шьонеберг на Джон Кенеди, 26 юни 1963. Продължителност 9:01."@bg .

VladimirAlexiev avatar Feb 16 '15 09:02 VladimirAlexiev

Hi, if I understand correctly, all that needs to be done is output description "Речта от Ратхаус Шьонеберг на Джон Кенеди, 26 юни 1963. Продължителност 9:01."@bg . for the example in the summary, with the space before от being preserved?

nurav avatar Feb 06 '16 09:02 nurav

exactly

jimkont avatar Feb 06 '16 14:02 jimkont

All right, working on it :+1:

nurav avatar Feb 07 '16 10:02 nurav

In what file can I find the logic that generates the extraction?

nurav avatar Feb 18 '16 19:02 nurav

Should be the org.dbpedia.extraction.dataparser.StringParser in core module. probably the nodeToString function should take care of LinkNode's (org/dbpedia/extraction/wikiparser/LinkNode.scala)

jimkont avatar Feb 19 '16 11:02 jimkont