ChEBI
ChEBI copied to clipboard
Feature Request: Provide at least one version of the raw data files with UTF-8 encoding
First, I just wanted to say -- thanks so much for the incredible resource! Being able to access the raw database dumps that back ChEBI is fantastic.
As mentioned in the ChEBI FTP site README file:
- The data has been exported in an ASCII format.
Due to this conversion, when using this exported data, it becomes difficult to display it in a user friendly way. For example, the definition of the deuterium atom (CHEBI:29237) is shown as follows on the ChEBI website:
The stable isotope of hydrogen with relative atomic mass 2.014102 and a natural abundance of 0.0115 atom percent (from Greek δευτερον, second).
However, in the flat tab separated data file dump, it is stored as:
The stable isotope of hydrogen with relative atomic mass 2.014102 and a natural abundance of 0.0115 atom percent (from Greek deltaepsilonupsilontauepsilonrhoomicronnu, second).
In general, the conversion from the unicode characters to the ASCII format is lossy -- there is no general way to convert back to the greek characters (that won't run into weird corner cases).
This could be solved by providing the flat file (or other dump files) in the UTF-8 format, which removes all ambiguity.
Is this something that could be considered for some future data release?
Thanks, Nate
Thanks. We are currently redeveloping ChEBI and this is something we can definitely take a look at and fix in the new system.