ChEBI icon indicating copy to clipboard operation
ChEBI copied to clipboard

TSV exports are broken

Open pbordron opened this issue 4 years ago • 5 comments

Hi,

While trying importing the compound_origins.tsv flat file with R, I get the following error:

Warning: 25 parsing failures.
 row col   expected                                                                             actual                                                                                   file
7963  -- 11 columns 10 columns                                                                         'ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/compound_origins.tsv'
7964  ID an integer Tine Vandenbrouck, Oliver A.H. Jones, Nathalie Dom, Julian L. Griffin, Wim De Coen 'ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/compound_origins.tsv'
7964  -- 11 columns 1 columns                                                                          'ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/compound_origins.tsv'
7965  ID an integer Environment International 36 (2010) 254-268                                        'ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/compound_origins.tsv'
7965  -- 11 columns 2 columns [... truncated]

The import script is the following:

library('tidyverse')

COMPOUND_ORIGIN_FILE <- 'ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/compound_origins.tsv'
compound_origins <- read_tsv(
   COMPOUND_ORIGIN_FILE,
   col_types = cols(
       ID = col_integer(),
       COMPOUND_ID = col_integer(),
       SPECIES_TEXT = col_character(),
       SPECIES_ACCESSION = col_character(),
       COMPONENT_TEXT = col_character(),
       COMPONENT_ACCESSION = col_character(),
       STRAIN_TEXT = col_character(),
       STRAIN_ACCESSION = col_character(),
       SOURCE_TYPE = col_character(),
       SOURCE_ACCESSION = col_character(),
       COMMENTS = col_character()
       ),
   na = c("", "NA", "null"))

When looking at the data with a cat -A, I get things like that:

8033^I27570^IDaphnia magna^INCBI:txid35525^Inull^Inull^Inull^Inull^IArticle^IMixtures of similarly acting compounds in Daphnia magna: From gene to metabolite and beyond$
Tine Vandenbrouck, Oliver A.H. Jones, Nathalie Dom, Julian L. Griffin, Wim De Coen$
Environment International 36 (2010) 254-268^Inull$

Please, could you use a correct csv/tsv exporter, or at least quote fields that contain line break as defined in the csv rfc(tsv is a csv dialect) https://tools.ietf.org/html/rfc4180.html#page-3

When looking at the compound_origins.sql file as an alternative to tsv file, I found many problematic artifacts. For example duplicated strings:

insert into compound_origins (id, compound_id, species_text, species_accession, component_text, component_accession, strain_text, source_type, source_accession, comments) values ('8043','18186','Daphnia magna','NCBI:txid35525','','','','Article','Mixtures of similarly acting compounds in Daphnia magna: From gene to metabolite and beyondTine Vandenbrouck, Oliver A.H. Jones, Nathalie Dom, Julian L. Griffin, Wim De CoenEnvironment International 36 (2010) 254-268Mixtures of similarly acting compounds in Daphnia magna: From gene to metabolite and beyondTine Vandenbrouck, Oliver A.H. Jones, Nathalie Dom, Julian L. Griffin, Wim De CoenEnvironment International 36 (2010) 254-268','');

I am also able to found my problematic entry (8033) with some "\n" in it, whereas other entry for the same publication doesn't have "\n".

insert into compound_origins (id, compound_id, species_text, species_accession, component_text, component_accession, strain_text, source_type, source_accession, comments) values ('8031','15356','Daphnia magna','NCBI:txid35525','','','','Article','Mixtures of similarly acting compounds in Daphnia magna: From gene to metabolite and beyondTine Vandenbrouck, Oliver A.H. Jones, Nathalie Dom, Julian L. Griffin, Wim De CoenEnvironment International 36 (2010) 254-268','');
insert into compound_origins (id, compound_id, species_text, species_accession, component_text, component_accession, strain_text, source_type, source_accession, comments) values ('8033','27570','Daphnia magna','NCBI:txid35525','','','','Article','Mixtures of similarly acting compounds in Daphnia magna: From gene to metabolite and beyond\nTine Vandenbrouck, Oliver A.H. Jones, Nathalie Dom, Julian L. Griffin, Wim De Coen\nEnvironment International 36 (2010) 254-268','');

Could you fix those problems? Thanks a lot

Philippe

pbordron avatar Feb 01 '21 08:02 pbordron

nobody to solve this issue?

pbordron avatar Jul 06 '21 10:07 pbordron

Hi, are you able to provide me with a list of ChEBI ID's which have these issues and i will try and fix them manually. For example, i have replaced the article name in the problematic entry 8033 (CHEBI:27570) with the PMID of the article which will solve most of the issues.

amalik01 avatar Jul 06 '21 11:07 amalik01

Unfortunately, we are a little understaffed at ChEBI right now, and I work for other projects as well. I know the files should be properly formatted, but we do not have the resources right now to perform new development/fixes of code often. However, some of this issues might be fixed by our curator. I found that the problematic records right now are:

cat -A compound_origins.tsv | grep -v -B 1 "^[0-9]"
ID^ICOMPOUND_ID^ISPECIES_TEXT^ISPECIES_ACCESSION^ICOMPONENT_TEXT^ICOMPONENT_ACCESSION^ISTRAIN_TEXT^ISTRAIN_ACCESSION^ISOURCE_TYPE^ISOURCE_ACCESSION^ICOMMENTS$
--
2834^I67776^ILaurencia marilzae^INCBI:txid99905^Inull^Inull^Inull^Inull^IPubMed Id^I21338119^I1.CH2Cl2/MeOH(1:1) extract of fresh alga.2.Bromoallene stereo is 'R'$
$
--
3123^I67967^IConoideocrella tenuis^INCBI:txid1105321^Imycelium^IBTO:0001436^IBCC 18627^Inull^IPubMed Id^I21473608^I1.Filtrate of mycelia macerated with methanol2.atropisomer of ES-242-2$
$
--
8033^I27570^IDaphnia magna^INCBI:txid35525^Inull^Inull^Inull^Inull^IArticle^IMixtures of similarly acting compounds in Daphnia magna: From gene to metabolite and beyond$
Tine Vandenbrouck, Oliver A.H. Jones, Nathalie Dom, Julian L. Griffin, Wim De Coen$
Environment International 36 (2010) 254-268^Inull$
--
8077^I28860^IDaphnia magna^INCBI:txid35525^Inull^Inull^Inull^Inull^IArticle^IMixtures of similarly acting compounds in Daphnia magna: From gene to metabolite and beyond$
Tine Vandenbrouck, Oliver A.H. Jones, Nathalie Dom, Julian L. Griffin, Wim De Coen$
Environment International 36 (2010) 254-268^Inull$
--
8082^I50464^IDaphnia galeata^INCBI:txid27404^Inull^Inull^Inull^Inull^IArticle^IIs the fatty acid composition of Daphnia galeata determined by the fatty acid composition of the ingested diet?$
Weers P.M.M., Siewertsen K., and Gulati R.D.$
Freshwater Biology (1997), 38, 731-738^Inull$

@amalik01 might be able to help us with this removing the linefeed characters, and in our next release they should be ok

juanfmx2 avatar Jul 06 '21 11:07 juanfmx2

I have fixed the entries mentioned by @juanfmx2 but if there are other entries which require fixing, then please let me know.

amalik01 avatar Jul 06 '21 12:07 amalik01

Hi, I found that compound_origins_3star.tsv has a wrong header. Not only the name of the columns but also the number of them. I think it is the same header as compounds*.tsv.gz. This causes unexpected results when relying on the header for selecting columns. I checked an old version from a previous release and they have the same error. The full version, compound_origins.tsv have a correct header though.

jlopez-ibanez avatar Apr 06 '22 08:04 jlopez-ibanez