extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

Escaped '\n' in extracted URIs

Open Integer-Ctrl opened this issue 8 months ago • 3 comments

Issue validity

Examples contain '\n' in URIs:

Error Description

The extracted triples contain URIs where '\n' (escaped newline characters) appear within the URI string, which violates URI syntax and leads to broken links.

Pinpointing the source of the error

The DIEF extractor includes newline characters in some URIs. The extraction process itself completes without error, but the resulting triples contain these invalid URIs.

Details

Here is an example of the extraction of Berlin. Not all lines which contain '\n' included

Wrong triples

<http://de.dbpedia.org/resource/Berlin> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\n_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> .
<http://commons.wikimedia.org/wiki/Special:FilePath/\n_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/\n_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg?width=300> .
<http://commons.wikimedia.org/wiki/Special:FilePath/\n_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Image> .
<http://commons.wikimedia.org/wiki/Special:FilePath/\n_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg?width=300> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Image> .
<http://commons.wikimedia.org/wiki/Special:FilePath/\n_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> <http://purl.org/dc/elements/1.1/rights> <http://de.wikipedia.org/wiki/Datei:\n_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> .
<http://commons.wikimedia.org/wiki/Special:FilePath/\n_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://de.wikipedia.org/wiki/Datei:\n_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> .
<http://de.dbpedia.org/resource/Berlin> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\n_Baureihe_483-484_der_S-Bahn_Berlin.jpg> .
<http://commons.wikimedia.org/wiki/Special:FilePath/\n_Baureihe_483-484_der_S-Bahn_Berlin.jpg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/\n_Baureihe_483-484_der_S-Bahn_Berlin.jpg?width=300> .
<http://commons.wikimedia.org/wiki/Special:FilePath/\n_Baureihe_483-484_der_S-Bahn_Berlin.jpg> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Image> .
<http://commons.wikimedia.org/wiki/Special:FilePath/\n_Baureihe_483-484_der_S-Bahn_Berlin.jpg?width=300> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Image> .
<http://commons.wikimedia.org/wiki/Special:FilePath/\n_Baureihe_483-484_der_S-Bahn_Berlin.jpg> <http://purl.org/dc/elements/1.1/rights> <http://de.wikipedia.org/wiki/Datei:\n_Baureihe_483-484_der_S-Bahn_Berlin.jpg> .
<http://commons.wikimedia.org/wiki/Special:FilePath/\n_Baureihe_483-484_der_S-Bahn_Berlin.jpg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://de.wikipedia.org/wiki/Datei:\n_Baureihe_483-484_der_S-Bahn_Berlin.jpg> .

Expected / corrected outcome snippet

<http://de.dbpedia.org/resource/Berlin> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> .
<http://commons.wikimedia.org/wiki/Special:FilePath/_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg?width=300> .
<http://commons.wikimedia.org/wiki/Special:FilePath/_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Image> .
<http://commons.wikimedia.org/wiki/Special:FilePath/_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg?width=300> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Image> .
<http://commons.wikimedia.org/wiki/Special:FilePath/_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> <http://purl.org/dc/elements/1.1/rights> <http://de.wikipedia.org/wiki/Datei:_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> .
<http://commons.wikimedia.org/wiki/Special:FilePath/_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://de.wikipedia.org/wiki/Datei:_Berlin_U-Bahn_IK_at_Olympia-Stadion_(3).jpg> .
<http://de.dbpedia.org/resource/Berlin> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/_Baureihe_483-484_der_S-Bahn_Berlin.jpg> .
<http://commons.wikimedia.org/wiki/Special:FilePath/_Baureihe_483-484_der_S-Bahn_Berlin.jpg> <http://xmlns.com/foaf/0.1/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/_Baureihe_483-484_der_S-Bahn_Berlin.jpg?width=300> .
<http://commons.wikimedia.org/wiki/Special:FilePath/_Baureihe_483-484_der_S-Bahn_Berlin.jpg> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Image> .
<http://commons.wikimedia.org/wiki/Special:FilePath/_Baureihe_483-484_der_S-Bahn_Berlin.jpg?width=300> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Image> .
<http://commons.wikimedia.org/wiki/Special:FilePath/_Baureihe_483-484_der_S-Bahn_Berlin.jpg> <http://purl.org/dc/elements/1.1/rights> <http://de.wikipedia.org/wiki/Datei:_Baureihe_483-484_der_S-Bahn_Berlin.jpg> .
<http://commons.wikimedia.org/wiki/Special:FilePath/_Baureihe_483-484_der_S-Bahn_Berlin.jpg?width=300> <http://purl.org/dc/elements/1.1/rights> <http://de.wikipedia.org/wiki/Datei:_Baureihe_483-484_der_S-Bahn_Berlin.jpg> .

Integer-Ctrl avatar Apr 14 '25 14:04 Integer-Ctrl

also occurs in english: https://dief.tools.dbpedia.org/server/extraction/en/extract?title=Iceland&revid=&format=turtle-triples&extractors=custom

kurzum avatar Apr 15 '25 07:04 kurzum

Hi! I’d like to help with this issue.

From the examples, it looks like the extractor is emitting literal “\n” escape sequences inside URIs (instead of stripping or normalizing them), which results in syntactically invalid IRIs and broken triples. I tested a few samples in the DIEF extraction path and traced the problem to newline characters leaking into file paths / titles during template processing.

Before I start working on a fix, I want to confirm the expected behaviour:

• Should all escaped '\n' sequences inside URIs be removed entirely (i.e., normalize to “_” or an empty string)?
• Or should any whitespace/newline inside a URI be percent-encoded as %0A?
• Is the preferred solution to sanitize titles/filenames early in the template/image extraction step, or later during URI construction?

My plan would be to add a normalisation step that strips or encodes newline characters before URI generation, and add tests covering both the incorrect “\n_xxx.jpg” pattern and valid output.

Let me know if this direction makes sense — I’m happy to take this on.

arnavsharma990 avatar Nov 27 '25 15:11 arnavsharma990

Hi @Integer-Ctrl @kurzum,

I’ve been working on this issue for the past 5 days, and I concluded that the bug occurs due to regex matching in ImageExtractor.scala. The regex does not exclude newline characters, which causes problems when wiki markup is written like this:

[[File: Berlin_Map.png]]

In contrast, in ImageExtractorNew.scala, this issue does not occur because it does not use regex. The following line explicitly excludes newline characters:

// ImageLink-Content } else if (iterator.hasNext) { if (c != ':' && c != '=' && c != '|' && c != '\n') { sb.append(c) } }

This ensures that \n is never included in the extracted filename. Therefore, this bug only affects pipelines that are still using the old ImageExtractor.

Thank you!

Aman-Baliyan avatar Dec 10 '25 06:12 Aman-Baliyan

Might be related to #748.

jmkeil avatar Jan 05 '26 14:01 jmkeil