datahub icon indicating copy to clipboard operation
datahub copied to clipboard

brca_tcga_pan_can_atlas_2018 data_sv.txt is not formatted correctly

Open mjsteinbaugh opened this issue 1 year ago • 4 comments

Hi cBioPortal team,

I noticed a parsing issue with data_sv.txt from the brca_tcga_pan_can_atlas_2018 dataset.

Here's a reproducible example in R:

## using pipette package
## https://github.com/acidgenomics/r-pipette
con <- "https://github.com/cBioPortal/datahub/raw/master/public/brca_tcga_pan_can_atlas_2018/data_sv.txt"
tsv <- pipette::import(con = con, format = "tsv", engine = "base")
→ Importing <https://github.com/cBioPortal/datahub/raw/master/public/brca_tcga_pan_can_atlas_2018/data_sv.txt> using base::`read.table()`.
Error in (function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",  :
  more columns than column names
Calls: <Anonymous> ... import -> .local -> do.call -> do.call -> <Anonymous>

This file is not correctly formatted, and contains a mismatch of 13 and 17 columns per line. I'm happy to provide a pull request to fix this issue.

Best, Mike

mjsteinbaugh avatar May 03 '23 20:05 mjsteinbaugh

Here's a slightly more informative message from data.table's fread engine:

> tsv <- pipette::import(con = con, format = "tsv", engine = "data.table")
→ Importing <https://github.com/cBioPortal/datahub/raw/master/public/brca_tcga_pan_can_atlas_2018/data_sv.txt> using data.table::`fread()`.
Warning in (function (input = "", file = NULL, text = NULL, cmd = NULL,  :
  Detected 13 column names but the data has 17 columns (i.e. invalid file). Added 4 extra default column names at the end.
Calls: <Anonymous> ... import -> .local -> do.call -> do.call -> <Anonymous>

mjsteinbaugh avatar May 03 '23 20:05 mjsteinbaugh

Here's a fixed version of the file that now works with cBioPortalData in Bioconductor

data_sv.txt

mjsteinbaugh avatar May 09 '23 18:05 mjsteinbaugh

Thanks for noticing that and sharing the fix @mjsteinbaugh. We will update it in the portal.

rmadupuri avatar May 09 '23 21:05 rmadupuri

@rmadupuri I noticed a few other files aren't formatted correctly. @LiNk-NY and I are putting together a list and I'll update the thread here. Thanks!!

mjsteinbaugh avatar May 09 '23 21:05 mjsteinbaugh