datahub icon indicating copy to clipboard operation
datahub copied to clipboard

gbm_cptac_2021 data issue

Open alexsigaras opened this issue 3 years ago • 3 comments

Cohort location on repo: public/gbm_cptac_2021 Latest commit hash: 490bba9ec1b5f41c386932836d531a5ba07bc01a cBioPortal version: 3.7.22 Import attempted on: 01/20/2022 Validation status: Failed Import status: Failed Error: _csv.Error: line contains NUL

Validation fails when validating file data_methylation_epic.txt

Attaching the log gbm_cptac_2021_import_01_22_2022.log

alexsigaras avatar Jan 24 '22 14:01 alexsigaras

@alexsigaras not sure why the validator throws error for csv error while data_methylation_epic.txt is a text file. We have circle CI running the validator (validateData.py) on all studies every week and I just checked this study has been passing validation. This is the report from last week: https://app.circleci.com/pipelines/github/cBioPortal/datahub/2735/workflows/a7b0de6e-31ae-4386-a49a-017861f792ba/jobs/8597/steps

yichaoS avatar Feb 08 '22 22:02 yichaoS

@n1zea144
I have ran the docker import of this study and got this importing error instead (validation passed). The study imported to triage successfully at the same time.

Reading data from:  /study/gbm_cptac_2021/data_mirna.txt
Recaching...
Finished recaching...
--> profile id:  11
--> profile name:  miRNA expression (FPKM uq)
--> genetic alteration type:  MRNA_EXPRESSION
--> total number of samples: 98
--> total number of data lines:  2883
--> records inserted into `sample_profile` table: 98
--> total number of data entries skipped (see table below):  2883
org.mskcc.cbio.portal.dao.DaoException: Something has gone wrong!  I did not save any records to the database!
at org.mskcc.cbio.portal.scripts.ImportTabDelimData.importData(ImportTabDelimData.java:307)
at org.mskcc.cbio.portal.scripts.ImportProfileData.run(ImportProfileData.java:125)
at org.mskcc.cbio.portal.scripts.ConsoleRunnable.runInConsole(ConsoleRunnable.java:145)
at org.mskcc.cbio.portal.scripts.ImportProfileData.main(ImportProfileData.java:150)

Warnings / Errors:
-------------------
0.  Entrez_Id null not found. Record will be skipped for this gene.; 2883x

ABORTED!
java.lang.RuntimeException: org.mskcc.cbio.portal.dao.DaoException: Something has gone wrong!  I did not save any records to the database!
at org.mskcc.cbio.portal.scripts.ImportProfileData.run(ImportProfileData.java:130)
at org.mskcc.cbio.portal.scripts.ConsoleRunnable.runInConsole(ConsoleRunnable.java:145)
at org.mskcc.cbio.portal.scripts.ImportProfileData.main(ImportProfileData.java:150)
Caused by: org.mskcc.cbio.portal.dao.DaoException: Something has gone wrong!  I did not save any records to the database!
at org.mskcc.cbio.portal.scripts.ImportTabDelimData.importData(ImportTabDelimData.java:307)
at org.mskcc.cbio.portal.scripts.ImportProfileData.run(ImportProfileData.java:125)
... 2 more
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Error occurred during data loading step. Please fix the problem and run this again to make sure study is completely loaded.
Traceback (most recent call last):
  File "/usr/local/bin/metaImport.py", line 202, in <module>
    cbioportalImporter.main(args)
  File "/cbioportal/core/src/main/scripts/importer/cbioportalImporter.py", line 533, in main
    process_directory(jvm_args, study_directory, args.update_generic_assay_entity)
  File "/cbioportal/core/src/main/scripts/importer/cbioportalImporter.py", line 368, in process_directory
    import_study_data(jvm_args, meta_filename, data_filename, update_generic_assay_entity, study_meta_dictionary[meta_filename])
  File "/cbioportal/core/src/main/scripts/importer/cbioportalImporter.py", line 162, in import_study_data
    run_java(*args)
  File "/cbioportal/core/src/main/scripts/importer/cbioportal_common.py", line 990, in run_java
    raise RuntimeError('Aborting due to error while executing step.')
RuntimeError: Aborting due to error while executing step.
ERROR: 1

yichaoS avatar Feb 16 '22 17:02 yichaoS

Hi @yichaoS @alexsigaras I started to look at this issue this afternoon. @alexsigaras from your log, it looks like you are running metaImport.py from a clone of the cbiportal github repository and validating against metadata in cbioportal.org. I attempted to recreate a similar setup:

  • I grabbed the latest datahub
  • I grabbed the latest master branch
  • I ran the following ./portal/core/src/main/scripts/importer/metaImport.py -u https://www.cbioportal.org -s ~/prgs/cbio/cbio-portal-data/datahub/public/gbm_cptac_2021 -v

I was able to successfully validate all the files. Here is my output for the methylation file in question:

DEBUG: data_methylation_epic.txt: Starting validation of file WARNING: data_methylation_epic.txt: line 1: The recommended column Entrez_Gene_Id was not found. Using Hugo_Symbol for all gene parsing. WARNING: data_methylation_epic.txt: lines [5, 6, 13, (238223 more)]: Gene symbol not known to the cBioPortal instance. This record will not be loaded.; values encountered: ['IARS', 'NA', 'HHLA1;RP11-240B13.2', '(27837 more)'] WARNING: data_methylation_epic.txt: lines [120, 301, 311, (261753 more)]: Duplicate line for a previously listed feature/gene, this line will be ignored.; values encountered: ['54897 (already defined on line 118)', '23281 (already defined on line 245)', '55048 (already defined on line 232)', '(16553 more)'] WARNING: data_methylation_epic.txt: lines [17132, 30758, 106954, (14 more)]: Hugo Symbol is not in gene or alias table and starts with a number. This can be caused by unintentional gene conversion in Excel.; values encountered: ['7SK;RP11-141O11.2', '5S_RRNA', '7SK;AP000233.2', '(6 more)'] WARNING: data_methylation_epic.txt: lines [518308, 518310, 518313, (110422 more)]: Gene symbol not known to the cBioPortal instance. This record will not be loaded.; values encountered: ['AL645941.1;HLA-DMB', 'RP11-50B3.4;RPUSD4', 'WNT9B;WNT3', '(21994 more)'] WARNING: data_methylation_epic.txt: lines [518309, 518311, 518312, (127262 more)]: Duplicate line for a previously listed feature/gene, this line will be ignored.; values encountered: ['171017 (already defined on line 10948)', '10133 (already defined on line 45933)', '23263 (already defined on line 1213)', '(16286 more)'] WARNING: data_methylation_epic.txt: lines [520714, 543447, 582130, (10 more)]: Hugo Symbol is not in gene or alias table and starts with a number. This can be caused by unintentional gene conversion in Excel.; values encountered: ['7SK;PLCG2', '5S_RRNA;FMR1NB', '5S_RRNA', '(4 more)'] INFO: data_methylation_epic.txt: Validation of file complete INFO: data_methylation_epic.txt: Read 756641 lines. Lines with warning: 737673. Lines with error: 0

When @yichaoS first described the issue to me, I thought maybe it was because our validators were pointing to different cbioportal instances, but that is not the case. Given your validator output (_cvs.Error line contains NUL) I wonder if something is corrupt in your copies of the study data. Below are md5sums of my copies of the data files.

Just for grins, here are md5sums for some of my scripts (maybe some behavior was changed): MD5 (./portal/core/src/main/scripts/importer/metaImport.py) = 2864b26ee158bda569e01b1b59a2ee1d MD5 (./portal/core/src/main/scripts/importer/validateData.py) = 6f8ab6e02a64f178be4952f16239e176

gbm_cptac_2021/data_acetylprotein_quantification.txt) = a330f41c381b9381c34f4e569f2c9044 gbm_cptac_2021/data_circular_rna.txt) = 184450ab1c465d91a290621ad394d1d6 gbm_cptac_2021/data_clinical_patient.txt) = 4a4903fc2aa45aacd5f71babea09e926 gbm_cptac_2021/data_clinical_sample.txt) = ba5d440c689b39776a14b5829db4e2b7 gbm_cptac_2021/data_cna.txt) = c6deb4565662056a19c1ab97f40bdb82 gbm_cptac_2021/data_cna_hg19.seg) = a77b597551dd368e0332743d75383938 gbm_cptac_2021/data_lipidome_negative_quantification.txt) = 1f87f3d788701b8ff94f91f824fb68c0 gbm_cptac_2021/data_lipidome_positive_quantification.txt) = f8b2995ddd0ebd6085d1692af5bd7606 gbm_cptac_2021/data_log2_cna.txt) = 1d01c3f637a868d50cabb43b8732ecd2 gbm_cptac_2021/data_metabolome_quantification.txt) = 444cd93fca42a9e2d03fc9c85e584520 gbm_cptac_2021/data_methylation_epic.txt) = f86ddc840759a371c8460fc33372ce30 gbm_cptac_2021/data_mirna.txt) = 94c9523e6de388660c9b2e281f11306d gbm_cptac_2021/data_mirna_zscores.txt) = 3650fb06a61b50dc116696154e67d7aa gbm_cptac_2021/data_mrna_seq_fpkm.txt) = ebae8e41f149106c77b8bb646d413c35 gbm_cptac_2021/data_mrna_seq_fpkm_zscores_ref_all_samples.txt) = 55011052c5b71511d7ff4a94e668dec2 gbm_cptac_2021/data_mrna_seq_fpkm_zscores_ref_diploid_samples.txt) = d38c37946b65baa9d529e0b1fbf401a0 gbm_cptac_2021/data_mutations.txt) = 7c69eb0d90828326c6bbcaab548ef3bd gbm_cptac_2021/data_phosphoprotein_quantification.txt) = 3cf27ea0f0eea59e987d9b6f54ac125e gbm_cptac_2021/data_protein_quantification.txt) = 231a971fcfb2c4c7102bdd7579b2a275 gbm_cptac_2021/data_protein_quantification_zscores.txt) = 8cc4da8b28247afdf0f6416f4003874e gbm_cptac_2021/data_single_cell_cycle_phases.txt) = 3afa57b90e12ed18d022c5e4febcfe58 gbm_cptac_2021/data_single_cell_type_fractions.txt) = 02678f6d8f6415e5f586fe5d02399590 gbm_cptac_2021/meta_acetylprotein_quantification.txt) = 9b326765dafe66b49f2d2b97d043255f gbm_cptac_2021/meta_circular_rna.txt) = b5303583a0116a02e89f4b360a75ec69 gbm_cptac_2021/meta_clinical_patient.txt) = e21fc677997aa2d8339f31612183951c gbm_cptac_2021/meta_clinical_sample.txt) = 18ae06fba1845d63b911bcfeb1675404 gbm_cptac_2021/meta_cna.txt) = ca7ac7c88b1828087d5087dd9d2a3d91 gbm_cptac_2021/meta_cna_hg19_seg.txt) = b16130f3b49bf61169dc6fe463bbc527 gbm_cptac_2021/meta_lipidome_negative_quantification.txt) = 31bd16fee25b6ba6ee0434174312c18d gbm_cptac_2021/meta_lipidome_positive_quantification.txt) = de76896c69ca94f28be968ff4832102b gbm_cptac_2021/meta_log2_cna.txt) = ce02a2397a4a2fc00413a440cc3f5528 gbm_cptac_2021/meta_metabolome_quantification.txt) = 34ae32b8821be1dde43cb8ea20e36fc7 gbm_cptac_2021/meta_methylation_epic.txt) = 47f34057fb96bceb88998417aabc85a4 gbm_cptac_2021/meta_mirna.txt) = 3d10052b127145aa8688f123775742d4 gbm_cptac_2021/meta_mirna_zscores.txt) = 4c2bd703a399892ffad8f90547340fa4 gbm_cptac_2021/meta_mrna_seq_fpkm.txt) = 47463a7f4a376b29c0b183f055375cc2 gbm_cptac_2021/meta_mrna_seq_fpkm_zscores_ref_all_samples.txt) = 3c023de0835a2fccab44119d27941849 gbm_cptac_2021/meta_mrna_seq_fpkm_zscores_ref_diploid_samples.txt) = 5689a11402156765f4c94ec743ca8b72 gbm_cptac_2021/meta_mutations.txt) = 37058307cffc53e4ce3f00df4801b71a gbm_cptac_2021/meta_phosphoprotein_quantification.txt) = 69442dcf964eb79c1dd1e085d7663f30 gbm_cptac_2021/meta_protein_quantification.txt) = 30260caa5fec52bf2a337a1844e17ff4 gbm_cptac_2021/meta_protein_quantification_zscores.txt) = efeecfb8c7f0c8eaa783224186737357 gbm_cptac_2021/meta_single_cell_cycle_phases.txt) = f16e4a4b4842ee3a1631baf636488bd8 gbm_cptac_2021/meta_single_cell_type_fractions.txt) = 9df781f160ee20878ae6f8a0c346467b gbm_cptac_2021/meta_study.txt) = f80d0e2c4995f509cd4015dbd764c5bd

n1zea144 avatar Feb 18 '22 21:02 n1zea144