robot
robot copied to clipboard
handling of special chars
Is ROBOT able to manage special characters in within template annotations?
We had a case of an input robot.tsv (UTF-8) containing chāt
But the robot.ofn (UTF-8) output after running ROBOT had converted it to ch�t
FYI, the ROBOT template command was AL oboInOwl:hasExactSynonym@en SPLIT=|
I would guess the problem is here: https://github.com/ontodev/robot/blob/810cc837fd157e572a1f143e740fce394b64742d/robot-core/src/main/java/org/obolibrary/robot/TemplateHelper.java#L850-L854
We need to use an input constructor that accepts a character set, rather than using the default platform character set.
Thanks @balhoff. I was not able to replicate on macOS. I made my own little TSV:
ID Synonym
ID AL oboInOwl:hasExactSynonym@en SPLIT=|
obo:chat chāt
and ran robot template -t test.tsv -o test.ofn, and I see chāt in the output file.
@cmrn-rhi What operating system are you using?
@jamesaoverton Windows 10 (version 21H2) Sorry, should've mentioned it in the first place.
Thanks @cmrn-rhi!
@balhoff Do you see a way to do this in a backwards-compatible way?
Not offhand, but personally I think we can just call it a bug that needs to be fixed. I know you like to be more conservative than that!
This article says that UTF-8 is the default on all platforms (including Window) starting with JDK 18 https://medium.com/@andbin/jdk-18-and-the-utf-8-as-default-charset-8451df737f90. These are the notes about that change for OpenJDK: https://openjdk.org/jeps/400.
It also says that you can use this option to override the default file encoding: java -Dfile.encoding=UTF-8.
@cmrn-rhi Could you please try running that same template with that option set, and see if it resolves this issue? Something like java -Dfile.encoding=UTF-8 -jar robot.jar template ...
Just gave it a go and that did work!