extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

invalid hexadecimal characters in short_abstracts_en

Open jbenton-adc opened this issue 11 years ago • 5 comments

dbpedia 2014 dataset short_abstracts_en file downloaded from http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/en/short_abstracts_en.nt.bz2 on 9/29/2014

wget http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/en/short_abstracts_en.nt.bz2 bunzip2 short_abstracts_en.nt.bz2 head -n 1263475 short_abstracts_en.nt | tail > parse_error.nt arq --strict --data parse_error.nt --query query.rq 08:53:18 ERROR riot :: [line: 8, col: 122] Not a hexadecimal character: Failed to load data

  • I am using jena arq version 2.11.2

This seems to be the triple that is causing the problem: http://dbpedia.org/resource/Taiwanese_kana http://www.w3.org/2000/01/rdf-schema#comment "Taiwanese kana (\u30BF\u30A \u30F2\u30A1\u30CC \u30AE\u30A \u30AB\u30A \u30D3\u30A7\u30F ) is a katakana-based writing system once used to write Holo Taiwanese, when Taiwan was ruled by Japan. It functioned as a phonetic guide to hanzi, much like furigana in Japanese or Zhuyin fuhao in Chinese. There were similar systems for other languages in Taiwan as well, including Hakka and Formosan languages.The system was imposed by Japan at the time, and used in a few dictionaries, as well as textbooks."@en .

"\u30A " is not valid unicode

jbenton-adc avatar Oct 02 '14 14:10 jbenton-adc

+1

Hronom avatar Apr 25 '15 15:04 Hronom

Same reported here: http://stackoverflow.com/questions/26415922/why-do-i-get-not-a-hexadecimal-character-when-using-tdbloader2

mgns avatar Jul 14 '15 07:07 mgns

Other syntax errors in this file are on lines 1947033, 2245904, 2305615, 4391674. To fix it easily, use e.g. the variations on sed -i -e '4391674s/^/#/' short_abstracts_en.nt.

pasky avatar Nov 12 '15 18:11 pasky

From the next release we will switch to the ttl files that do not have this problems

jimkont avatar Nov 18 '15 06:11 jimkont

@Vehnem is this fixed/validated in the recent releases?

m1ci avatar May 15 '20 15:05 m1ci