GenomicDataCommons icon indicating copy to clipboard operation
GenomicDataCommons copied to clipboard

gdc_rnaseq.R: gdc_rnaseq() on workflows other than "HTSeq - Counts" produce errors

Open lyijin opened this issue 5 years ago • 2 comments

sorry for raising two successive issues with this R script.

i previously used the function to cache and return a SummarizedExperiment of HTSeq - Counts from TCGA data, and it worked fine without any hitches (well, i do get HTTP 429 errors when i tried to cache the entirety of TCGA, but i solved it by caching the individual projects of TCGA).

however, when i wanted to test something on the FPKM level, the function produces the same errors on both machines that i tried the command on.

i ran tcga_se <- gdc_rnaseq("TCGA-CHOL", "HTSeq - FPKM")

the caching ran fine, but it dies with:

Error in names(x) <- value :
  'names' attribute [1] must be the same length as the vector [0]

i'm not really sure what the error means, sorry--if you guys can replicate the error, could you look into why this error was produced? thanks!

(and btw if i could request a nitpicky improvement, could you please suppress the output

Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_double()
)

that floods my screen everytime i run the function gdc_rnaseq(). thanks!

lyijin avatar Jul 17 '18 07:07 lyijin

i think i found out what was wrong with the script--it's lines 146--148 that is crashing the script.

    mat_qc = data.frame(t(mat[qc_idx, -1]))
    colnames(mat_qc) = paste0('qc',mat[qc_idx,1])
    coldata = dplyr::bind_cols(coldata,mat_qc)

"HTSeq - Counts" files contains three extra lines at the bottom that start with "__" (double underscores), and from what i can tell, the code moves these lines into the colData of the SummarizedExperiment. FPKM / FPKM-UQ files do not contain lines with double underscores in them, therefore causing line 148 to crash.

i've monkey-patched my version to completely drop these three lines. the effect is that now i've subverted the crash when i ask for HTSeq - FPKM, but i do lose some information when the same function works on HTSeq - Counts files. i guess one could write an if block to wrap around these three lines that only gets executed when workflow_type == 'HTSeq - Counts', but i didn't need the info for now, hence i didn't mind the trade-off.

happy to share my code if others are facing the same problem while waiting for the code to be patched.

lyijin avatar Aug 01 '18 07:08 lyijin

Thank you for sharing your troubleshooting! I experienced the same problem.

In my case, I tried making a custom function that resembles gdc_rnaseq to fix it without re-installation.

.htseq_importer function also needed to be defined in this case.

hd00ljy avatar Aug 02 '21 07:08 hd00ljy