TCGAbiolinks icon indicating copy to clipboard operation
TCGAbiolinks copied to clipboard

error in GDCprepare

Open ch8316f5eyu opened this issue 3 years ago • 11 comments

I encountered an error in GDCprepare. There are the codes:

query.exp <- GDCquery(project = 'CPTAC-3', legacy = F, data.category = "Transcriptome Profiling", data.type = 'Gene Expression Quantification', workflow.type = 'HTSeq - Counts', experimental.strategy = "RNA-Seq") GDCdownload(query.exp) x = GDCprepare(query = query.exp, save = T, save.filename = paste0('~/project/cancer/TCGA/exp_CPTAC-3.rda')) The error is after GDCprepare:

|==================================================================================================================================|100% Completed after 1 m Starting to add information to samples => Add clinical information to samples Error in xj[i] : invalid subscript type 'list' Thanks.

ch8316f5eyu avatar Aug 31 '21 14:08 ch8316f5eyu

GDCquery breaks for CPTAC-3 due to mixed samples.

query.exp <- GDCquery(
    project = 'CPTAC-3', 
    legacy = F,
    data.category = "Transcriptome Profiling", 
    data.type = 'Gene Expression Quantification',
    workflow.type = 'HTSeq - Counts', 
    experimental.strategy = "RNA-Seq"
)

query.exp$results[[1]] <- query.exp$results[[1]][1:100,]
GDCdownload(query.exp,files.per.chunk = 100) 
x <- GDCprepare(query = query.exp, save = F)

Screen Shot 2021-10-24 at 8 25 18 PM

tiagochst avatar Oct 25 '21 00:10 tiagochst

Same bugs in CPTAC-3 using GDCprepare please help!

query.exp = GDCquery(project = "CPTAC-3", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", data.format="TSV", workflow.type = "STAR - Counts") GDCdownload(query.exp, method = "api", files.per.chunk = 10) Downloading data for project CPTAC-3 Of the 1883 files for download 1883 already exist. All samples have been already downloaded

pre.exp = GDCprepare(query = query.exp) |===============================================================================================================================|100% Completed after 1 m Error in levels<-(*tmp*, value = as.character(levels)) : factor level [81] is duplicated

huiyijiangling avatar Jul 31 '22 17:07 huiyijiangling

@huiyijiangling Thank you for reporting this bug. It seems CPTAC-3 barcode does not differ replicates as the other projects, but I need to double check it. For example, C3N-02765-02 has 4 files with counts.

Screen Shot 2022-08-01 at 9 20 58 AM Screen Shot 2022-08-01 at 9 20 48 AM

tiagochst avatar Aug 01 '22 12:08 tiagochst

@huiyijiangling There are 28 duplicated samples. For the moment, you can remove those samples and the code should work. I need to think more how to deal with this case without breaking the other one projects and parts of the code. Probably I will need to concatenate sample and analytes id for CPTAC-3. (i.e.C3N-02765-02_CPT0184450060 instead of C3N-02765-02)

query.exp <- GDCquery(
        project = 'CPTAC-3',
        legacy = F,
        data.category = "Transcriptome Profiling",
        data.type = 'Gene Expression Quantification',
        workflow.type = "STAR - Counts"
    )
# remove duplicated
query.exp$results[[1]] <- query.exp$results[[1]][!duplicated(query.exp$results[[1]]$sample.submitter_id),]

GDCdownload(query.exp,files.per.chunk = 40)
se <- GDCprepare(
    query = query.exp,
    save = F
)

tiagochst avatar Aug 01 '22 13:08 tiagochst

@huiyijiangling There are 28 duplicated samples. For the moment, you can remove those samples and the code should work. I need to think more how to deal with this case without breaking the other one projects and parts of the code. Probably I will need to concatenate sample and analytes id for CPTAC-3. (i.e.C3N-02765-02_CPT0184450060 instead of C3N-02765-02)

query.exp <- GDCquery(
        project = 'CPTAC-3',
        legacy = F,
        data.category = "Transcriptome Profiling",
        data.type = 'Gene Expression Quantification',
        workflow.type = "STAR - Counts"
    )
# remove duplicated
query.exp$results[[1]] <- query.exp$results[[1]][!duplicated(query.exp$results[[1]]$sample.submitter_id),]

GDCdownload(query.exp,files.per.chunk = 40)
se <- GDCprepare(
    query = query.exp,
    save = F
)

Thank you for your solution for reducing duplicated samples. CPTAC-3 often uses mixed samples in RNA-seq and protein expression quantification for QC or increasing content of tissue, which has different filenames but barcode/submitted_case_id/submitted_sample_id are not unique. I will take your solution for reducing duplicated samples, and I'm looking forward to see the problems fixed in the next version. Thank you again!

huiyijiangling avatar Aug 01 '22 15:08 huiyijiangling

Hello! When I use the same code to download TARGET-AML datasets, which also have duplicated samples, I got the same error. query.exp <- GDCquery( project = 'TARGET-AML', legacy = F, data.category = "Transcriptome Profiling", data.type = 'Gene Expression Quantification', workflow.type = "STAR - Counts" ) # remove duplicated query.exp$results[[1]] <- query.exp$results[[1]][!duplicated(query.exp$results[[1]]$sample.submitter_id ),]

GDCdownload(query.exp,files.per.chunk = 40)
se <- GDCprepare(
    query = query.exp,
    save = F
)

yiyisun682 avatar Oct 27 '22 13:10 yiyisun682

The error messages are as follows:

yiyisun682 avatar Oct 27 '22 13:10 yiyisun682

=> Add clinical information to samples Error in .rowNamesDF<-(x, value = value) : invalid 'row.names' length

yiyisun682 avatar Oct 27 '22 13:10 yiyisun682

The error messages are as follows:

I have the same error to the same project did you find a solution ?

itscarolnunes avatar Jan 31 '23 00:01 itscarolnunes

Apologies in advance for my speculating - I don't have the most experience with code!

Also having this issue, here's a traceback:

Starting to add information to samples
Adding description to TARGET samples
Warning: Expected 5 pieces. Additional pieces discarded in 187 rows [57, 74, 77, 90, 95, 215, 240, 244, 279, 296, 313, 406, 411, 445, 453, 492, 498, 505, 507, 529, ...]. => Add clinical information to samples
Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
> traceback()
9: stop("invalid 'row.names' length")
8: `.rowNamesDF<-`(x, value = value)
7: `row.names<-.data.frame`(`*tmp*`, value = value)
6: `row.names<-`(`*tmp*`, value = value)
5: `rownames<-`(`*tmp*`, value = barcode)
4: colDataPrepare(cases)
3: makeSEfromTranscriptomeProfilingSTAR(data = df, cases = cases)
2: readTranscriptomeProfiling(files = files, data.type = ifelse(!is.na(query$data.type), 
       as.character(query$data.type), unique(query$results[[1]]$data_type)), 
       workflow.type = unique(query$results[[1]]$analysis_workflow_type), 
       cases = cases, summarizedExperiment)
1: GDCprepare(query)

Trying to follow this back through the GDCPrepare source code, the colDataPrepare function is correctly identifying the samples as TARGET samples, and calls "colDataPrepareTARGET" as evidenced by "Adding description to TARGET samples" output from within that function. Somewhere within that function the code is expecting 5 pieces and drops the extra (as seen in the warning) then proceeds through the remainder of colDataPrepare (as evidenced by the "Adding clinical data to samples" output.

Running debug(colDataPrepare) I can see that DFrame 'ret' being returned by colDataPrepareTARGET has rows of NAs where the warning indicates data was dropped. This then proceeds to the last row of colDataPrepare, where the dataframe row.names are set to sample barcodes - issue being you have the original X number of barcodes you passed to colDataPrepareTARGET, which then returned (X-187) samples, and you're trying to set row.names of an (X-187) dataframe with a list of X, throwing the invalid row.names length error.

I believe this is happening because of the following code in colDataPrepareTARGET: regex <- paste0("[:alnum:]{5}-[:alnum:]{2}-[:alnum:]{6}", "-[:alnum:]{3}-[:alnum:]{3}") samples <- str_match(barcode,regex)[,1]

Where the sample IDs are screened for the TARGET formatting - 5 alphanumeric characters followed by 2, followed by 6, then 3, then 3. This is indeed the format of target IDs such as "TARGET-20-PARNFZ-03A-01R" however the TARGET-AML database also includes some samples formatted like "TARGET-20-PAYHMK-Sorted-leukemic-09A-01R" which would not match the regex. There just so happen to be 187 of these in my query - matching the 187 discarded rows in the warning.

@tiagochst I assume these functions were written while all TARGET-AML samples matched that regex, and this error didn't exist. I'm not sure how to circumvent this issue short of dropping the 187 samples (which I'd rather not do!) and in all honesty I'm not 100% sure how drop those 187 specific samples from the query before attempting to prepare it. Any hope for a solution?

jaygamma avatar Jul 11 '23 22:07 jaygamma

@yiyisun682 @itscarolnunes you can bypass the error you're experiencing by running the following code immediately before GDCPrepare():

query$results[[1]] <- query$results[[1]] %>% filter(nchar(cases)==24)

This will pull your query results out, filter them for case IDs exactly 24 characters long (and therefore correctly formatted to pass the regex check) and set the query list to the filtered IDs.

Doing this reduced my query size from 3064 cases to 2809, but GDCPrepare() then completes without error.

jaygamma avatar Jul 17 '23 17:07 jaygamma