datahub
datahub copied to clipboard
update CCLE dataset
We have got permission from CCLE to update our data to their latest dataset.
-
[ ] create a issue on datahub before curating a study (one issue per study) and copy this checklist to the issue tracker
-
[ ] List information of the dataset/paper in the issue, e.g. pmid, paper link, suppl file link
-
[ ] Document the curation process, e.g. how and by whom the data was transformed
-
[ ] Follow the data checklist
-
[ ] Create a pull request to datahub once the data is curated
-
[ ] Push to triage portal
-
[ ] Import into msk and public portal database
-
[ ] Update cBioPortal news
-
download data from https://portals.broadinstitute.org/ccle/data
-
copy number is missing from the latest dataset. Let's use the old one. e.g.
CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct
as the discrete cna data,CCLE_copynumber_byGene_2013-12-03.txt
as the linear one, andCCLE_copynumber_2013-12-03.seg.txt
as the seg. -
the license of this dataset should be different. Please refer to CCLE's original term of access: https://portals.broadinstitute.org/ccle/about#terms
@pieterlukasse they also have drug profiling data. It may be useful for the feature you are developing.
- [x] CCLE_DepMap_18q3_maf_20180718.txt -> data_mutations_extended.txt
- [ ] CCLE_DepMap_18q3_RNAseq_RPKM_20180718.gct -> meta_RNA_Seq_expression_median.txt & meta_RNA_Seq_mRNA_median_Zscores.txt
- [ ] CCLE_miRNA_20180525.gct -> data_expression_miRNA.txt
- [x] CCLE_RRBS_TSS_1kb_20180614.txt -> data_methylation.txt (what file name should we use here?)
- [ ] CCLE_RPPA_20180123.csv -> data_rppa.txt
- [x] CCLE_copynumber_2013-12-03.seg.txt -> data_cna_hg19.seg
- [x] CCLE_copynumber_byGene_2013-12-03.txt -> data_linear_CNA.txt
- [x] CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct -> data_CNA.txt (?)
- [ ] ? -> data_clinical.txt
- [ ] CCLE_NP24.2009_Drug_data_2015.02.24.csv -> ?
Question: for mutation data, how many samples were covered in the new dataset? Should we include the old mutation data for the samples that is not covered?
@sandertan do we have scripts that can help with parsing the new CCLE data?
@pieterlukasse Yes we have code to parse the old CNA data.
@jjgao I created a gist with our code to do that. It also includes a step to run GISTIC for discrete CNA, but if you are using CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct
for that, that step can be ignored.
If any code to transform/remap the data is used, I think it would be nice to include it in the staging files, to let users and our future selves know how we processed it.
https://gist.github.com/sandertan/904874cb8d6b78076cdffb927412d0fe
thanks, @sandertan. Agreed. We should document data processing steps and ideally link link them in the profile description.
testing on triage.
- [ ] replace the study
ccle_broad
instead of creating a new one - [ ] removing potential germline variants
- [ ] adding oncotree code (not mixed) to all samples
- [ ] rna-seq data missing
- [ ] methylation data missing
- [ ] rppa data missing
- [ ] the drug profiling data would be interesting to have. @pieterlukasse is there a data format you are following?
@jjgao thanks for the update. Yes, we have already defined a data format for drug (or treatment, where treatment is a combination of two or more drugs) profiling. See https://github.com/thehyve/cbioportal/blob/treatment_study_implementation_rebase/docs/File-Formats.md#treatment-data by @pvannierop (PR to follow soon).
Here is an updated to do list now that ccle has new data.
- [x] CCLE_DepMap_18q3_maf_20180718.txt -> data_mutations_extended.txt
- [x] CCLE_Fusions_20181130.txt -> data_fusions.txt
- [x] CCLE_RNAseq_genes_rpkm_20180929.gct.gz -> data_RNA_Seq_expression_median.txt & data_RNA_Seq_mRNA_median_Zscores.txt
- [ ] CCLE_RNAseq_rsem_genes_tpm_20180929.txt.gz -> data_RNA_Seq_v2_expression_median.txt & data_RNA_Seq_v2_mRNA_median_Zscores.txt
- [ ] CCLE_miRNA_20181103.gct -> data_expression_miRNA.txt
- [ ] CCLE_RRBS_TSS1kb_20181022.txt -> data_methylation.txt (what file name should we use here?)
- [ ] CCLE_RPPA_20181003.csv -> data_rppa.txt
- [ ] CCLE_RPPA_Ab_info_20181226.csv -> gene panel for rppa
- [x] ? -> data_cna_hg19.seg (CCLE_copynumber_2013-12-03.seg.txt is too old)
- [ ] ? -> data_linear_CNA.txt (CCLE_copynumber_byGene_2013-12-03.txt is too old)
- [x] ? -> data_CNA.txt (gene level discrete copy number data)
- [ ] ? -> data_clinical.txt
- [ ] CCLE_NP24.2009_Drug_data_2015.02.24.csv -> ?
@jjgao @ritikakundra any updates on this one?
@jjgao @ritikakundra I think we need to add the mRNA data before rolling out the treatments feature - need to see what treatments and mRNA look like side by side in the Heatmap menu.
Can we also copy over the seg file from the old study?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I am reopening this. @ritikakundra could you check if everything is done and then close this? If not, maybe create another issue for the remaining items?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Working on releasing depmap 23Q4 Version.