cBioPortalData
cBioPortalData copied to clipboard
brca_tcga_pan_can_atlas_2018 is failing with `cBioDataPack`
Hi, I'm seeing a parsing error for brca_tcga_pan_can_atlas_2018
:
utils::read.table
chokes too easily on malformed files -- is it worth considering switching to readr/vroom or data.table here to harden against malformed files in the tarballs?
> packageVersion("cBioPortalData")
[1] ‘2.12.0’
> brca <- cBioPortalData::cBioDataPack("brca_tcga_pan_can_atlas_2018", ask = FALSE)
Warning: replacing previous import ‘utils::findMatches’ by ‘S4Vectors::findMatches’ when loading ‘AnnotationDbi’
Warning in .service_validate_md5sum(api_reference_url, api_reference_md5sum, :
service version differs from validated version
service url: https://www.cbioportal.org/api/v2/api-docs
observed md5sum: 008be96361f24a5c8d1cfb7f10ae9c97
expected md5sum: 07ceb76cc5afcf54a9cf2e1a689b18f7
Calls: <Anonymous> ... initialize -> initialize -> Service -> .service_validate_md5sum
Downloading study file: brca_tcga_pan_can_atlas_2018.tar.gz
|======================================================================| 100%
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_armlevel_cna.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_cna_hg19.seg
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_cna.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_gene_panel_matrix.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_log2_cna.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_methylation_hm27_hm450_merged.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_microbiome.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem_zscores_ref_all_samples.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem_zscores_ref_diploid_samples.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem_zscores_ref_normal_samples.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_mutations.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_phosphoprotein_quantification.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_protein_quantification_zscores.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_protein_quantification.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_rppa_zscores.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_rppa.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_sv.txt
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
Calls: <Anonymous> ... <Anonymous> -> .preprocess_data -> <Anonymous> -> read.table
Backtrace:
▆
1. └─cBioPortalData::cBioDataPack(...)
2. └─cBioPortalData::loadStudy(exdir, names.field, cleanup)
3. └─cBioPortalData:::.loadExperimentsFromFiles(...)
4. └─base::Map(...)
5. └─base::mapply(FUN = f, ..., SIMPLIFY = FALSE)
6. └─cBioPortalData (local) `<fn>`(y = dots[[1L]][[18L]], x = dots[[2L]][[18L]])
7. └─cBioPortalData:::.preprocess_data(...)
8. └─utils::read.delim(...)
9. └─utils::read.table(...)
Hi @mjsteinbaugh Switching to another reader might only alleviate the symptoms. It would be better to report data errors at https://github.com/cbioportal/cbioportal I will take a look at the details. Best, Marcel
Thanks @LiNk-NY I'll file a bug there too
Following up, I agree that it's better to fix the upstream source, but it does look like readr handles this file OK.
library(pipette)
con <- "https://github.com/cBioPortal/datahub/raw/master/public/brca_tcga_pan_can_atlas_2018/data_sv.txt"
## Errors, as expected.
sv_base <- import(
con = con,
format = "tsv",
colnames = TRUE
)
## Error in (function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", :
## more columns than column names
## Calls: import ... import -> .local -> do.call -> do.call -> <Anonymous>
## Munges the number of columns and names, not great.
sv_dt <- import(
con = con,
format = "tsv",
engine = "data.table",
colnames = TRUE
)
print(dim(sv_dt))
## [1] 5335 17
## The readr/vroom engine seems to parse OK.
sv_readr <- import(
con = con,
format = "tsv",
engine = "readr",
colnames = TRUE
)
print(dim(sv_readr))
## [1] 5336 13
print(colnames(sv_readr))
## [1] "Sample_Id" "Site1_Hugo_Symbol"
## [3] "Site1_Chromosome" "Site1_Position"
## [5] "Site2_Hugo_Symbol" "Site2_Chromosome"
## [7] "Site2_Position" "Site2_Effect_On_Frame"
## [9] "Tumor_Split_Read_Count" "Tumor_Paired_End_Read_Count"
## [11] "SV_Status" "NCBI_Build"
## [13] "Event_Info"
OK issue has been filed with the cBioPortal datahub team here https://github.com/cBioPortal/datahub/issues/1820
I can confirm that fixing the data_sv.txt
file fixes this issue:
https://github.com/cBioPortal/datahub/issues/1820#issuecomment-1540690313
## First, replace the `data_sv.txt` file in extracted directory.
object <- cBioPortalData::loadStudy("brca_tcga_pan_can_atlas_2018", cleanup = FALSE)
## A MultiAssayExperiment object of 18 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 18:
## [1] armlevel_cna: SummarizedExperiment with 39 rows and 1084 columns
## [2] cna_hg19.seg: RaggedExperiment with 210376 rows and 1068 columns
## [3] cna: SummarizedExperiment with 25128 rows and 1070 columns
## [4] log2_cna: SummarizedExperiment with 25128 rows and 1070 columns
## [5] methylation_hm27_hm450_merged: SummarizedExperiment with 22601 rows and 1066 columns
## [6] microbiome: SummarizedExperiment with 1406 rows and 1070 columns
## [7] mrna_seq_v2_rsem_zscores_ref_all_samples: SummarizedExperiment with 20531 rows and 1082 columns
## [8] mrna_seq_v2_rsem_zscores_ref_diploid_samples: SummarizedExperiment with 20471 rows and 1082 columns
## [9] mrna_seq_v2_rsem_zscores_ref_normal_samples: SummarizedExperiment with 20531 rows and 1082 columns
## [10] mrna_seq_v2_rsem: SummarizedExperiment with 20531 rows and 1082 columns
## [11] mutations: RaggedExperiment with 130495 rows and 1009 columns
## [12] phosphoprotein_quantification: SummarizedExperiment with 18806 rows and 105 columns
## [13] protein_quantification_zscores: SummarizedExperiment with 9733 rows and 105 columns
## [14] protein_quantification: SummarizedExperiment with 9733 rows and 105 columns
## [15] rppa_zscores: SummarizedExperiment with 198 rows and 876 columns
## [16] rppa: SummarizedExperiment with 198 rows and 876 columns
## [17] mrna_seq_v2_rsem_normal_samples_zscores_ref_normal_samples: SummarizedExperiment with 20531 rows and 114 columns
## [18] mrna_seq_v2_rsem_normal_samples: SummarizedExperiment with 20531 rows and 114 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Seeing parsing issues for:
-
blca_plasmacytoid_mskcc_2016/data_sv.txt
-
brca_tcga_pan_can_atlas_2018/data_sv.txt
-
coadread_tcga_pan_can_atlas_2018/data_sv.txt
-
ov_tcga_pan_can_atlas_2018/data_sv.txt
-
sarc_tcga_pan_can_atlas_2018/data_gene_panel_matrix.txt
Thanks for putting this together @mjsteinbaugh We will take a look at the data and file issues at cBioPortal.
@LiNk-NY I put together a pretty nifty script that attempts to process all of the datasets at cBioPortal. I'll update the list of failures here once it finishes running.
@mjsteinbaugh Have you taken a look at the long tests folder? https://github.com/waldronlab/cBioPortalData/tree/devel/longtests/testthat
OK here's an updated list of datasets with processing issues:
brca_tcga_pan_can_atlas_2018
ccrcc_utokyo_2013
coadread_tcga_pan_can_atlas_2018
gbm_cptac_2021
ihch_msk_2021
ihch_mskcc_2020
luad_mskimpact_2021
mbl_dkfz_2017
mbn_mdacc_2013
mixed_msk_tcga_2021
mixed_selpercatinib_2020
mpnst_mskcc
ov_tcga_pan_can_atlas_2018
pan_origimed_2020
pcpg_tcga_pub
sarc_tcga_pan_can_atlas_2018
stad_tcga_pub
ucec_ccr_msk_2022