datahub icon indicating copy to clipboard operation
datahub copied to clipboard

update CCLE dataset

Open jjgao opened this issue 6 years ago • 17 comments

We have got permission from CCLE to update our data to their latest dataset.

  • [ ] create a issue on datahub before curating a study (one issue per study) and copy this checklist to the issue tracker

  • [ ] List information of the dataset/paper in the issue, e.g. pmid, paper link, suppl file link

  • [ ] Document the curation process, e.g. how and by whom the data was transformed

  • [ ] Follow the data checklist

  • [ ] Create a pull request to datahub once the data is curated

  • [ ] Push to triage portal

  • [ ] Import into msk and public portal database

  • [ ] Update cBioPortal news

  • download data from https://portals.broadinstitute.org/ccle/data

  • copy number is missing from the latest dataset. Let's use the old one. e.g. CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct as the discrete cna data, CCLE_copynumber_byGene_2013-12-03.txt as the linear one, and CCLE_copynumber_2013-12-03.seg.txt as the seg.

  • the license of this dataset should be different. Please refer to CCLE's original term of access: https://portals.broadinstitute.org/ccle/about#terms

jjgao avatar Aug 04 '18 13:08 jjgao

@pieterlukasse they also have drug profiling data. It may be useful for the feature you are developing.

jjgao avatar Aug 04 '18 13:08 jjgao

  • [x] CCLE_DepMap_18q3_maf_20180718.txt -> data_mutations_extended.txt
  • [ ] CCLE_DepMap_18q3_RNAseq_RPKM_20180718.gct -> meta_RNA_Seq_expression_median.txt & meta_RNA_Seq_mRNA_median_Zscores.txt
  • [ ] CCLE_miRNA_20180525.gct -> data_expression_miRNA.txt
  • [x] CCLE_RRBS_TSS_1kb_20180614.txt -> data_methylation.txt (what file name should we use here?)
  • [ ] CCLE_RPPA_20180123.csv -> data_rppa.txt
  • [x] CCLE_copynumber_2013-12-03.seg.txt -> data_cna_hg19.seg
  • [x] CCLE_copynumber_byGene_2013-12-03.txt -> data_linear_CNA.txt
  • [x] CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct -> data_CNA.txt (?)
  • [ ] ? -> data_clinical.txt
  • [ ] CCLE_NP24.2009_Drug_data_2015.02.24.csv -> ?

Question: for mutation data, how many samples were covered in the new dataset? Should we include the old mutation data for the samples that is not covered?

jjgao avatar Aug 04 '18 13:08 jjgao

@sandertan do we have scripts that can help with parsing the new CCLE data?

pieterlukasse avatar Aug 07 '18 10:08 pieterlukasse

@pieterlukasse Yes we have code to parse the old CNA data.

@jjgao I created a gist with our code to do that. It also includes a step to run GISTIC for discrete CNA, but if you are using CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct for that, that step can be ignored.

If any code to transform/remap the data is used, I think it would be nice to include it in the staging files, to let users and our future selves know how we processed it.

https://gist.github.com/sandertan/904874cb8d6b78076cdffb927412d0fe

sandertan avatar Aug 08 '18 07:08 sandertan

thanks, @sandertan. Agreed. We should document data processing steps and ideally link link them in the profile description.

jjgao avatar Aug 08 '18 21:08 jjgao

testing on triage.

  • [ ] replace the study ccle_broad instead of creating a new one
  • [ ] removing potential germline variants
  • [ ] adding oncotree code (not mixed) to all samples
  • [ ] rna-seq data missing
  • [ ] methylation data missing
  • [ ] rppa data missing
  • [ ] the drug profiling data would be interesting to have. @pieterlukasse is there a data format you are following?

jjgao avatar Nov 21 '18 15:11 jjgao

@jjgao thanks for the update. Yes, we have already defined a data format for drug (or treatment, where treatment is a combination of two or more drugs) profiling. See https://github.com/thehyve/cbioportal/blob/treatment_study_implementation_rebase/docs/File-Formats.md#treatment-data by @pvannierop (PR to follow soon).

pieterlukasse avatar Dec 11 '18 08:12 pieterlukasse

Here is an updated to do list now that ccle has new data.

  • [x] CCLE_DepMap_18q3_maf_20180718.txt -> data_mutations_extended.txt
  • [x] CCLE_Fusions_20181130.txt -> data_fusions.txt
  • [x] CCLE_RNAseq_genes_rpkm_20180929.gct.gz -> data_RNA_Seq_expression_median.txt & data_RNA_Seq_mRNA_median_Zscores.txt
  • [ ] CCLE_RNAseq_rsem_genes_tpm_20180929.txt.gz -> data_RNA_Seq_v2_expression_median.txt & data_RNA_Seq_v2_mRNA_median_Zscores.txt
  • [ ] CCLE_miRNA_20181103.gct -> data_expression_miRNA.txt
  • [ ] CCLE_RRBS_TSS1kb_20181022.txt -> data_methylation.txt (what file name should we use here?)
  • [ ] CCLE_RPPA_20181003.csv -> data_rppa.txt
  • [ ] CCLE_RPPA_Ab_info_20181226.csv -> gene panel for rppa
  • [x] ? -> data_cna_hg19.seg (CCLE_copynumber_2013-12-03.seg.txt is too old)
  • [ ] ? -> data_linear_CNA.txt (CCLE_copynumber_byGene_2013-12-03.txt is too old)
  • [x] ? -> data_CNA.txt (gene level discrete copy number data)
  • [ ] ? -> data_clinical.txt
  • [ ] CCLE_NP24.2009_Drug_data_2015.02.24.csv -> ?

jjgao avatar Jan 31 '19 21:01 jjgao

@jjgao @ritikakundra any updates on this one?

pieterlukasse avatar May 14 '19 09:05 pieterlukasse

@jjgao @ritikakundra I think we need to add the mRNA data before rolling out the treatments feature - need to see what treatments and mRNA look like side by side in the Heatmap menu.

schultzn avatar Aug 30 '19 18:08 schultzn

Can we also copy over the seg file from the old study?

schultzn avatar Aug 30 '19 18:08 schultzn

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 11 '20 09:08 stale[bot]

I am reopening this. @ritikakundra could you check if everything is done and then close this? If not, maybe create another issue for the remaining items?

jjgao avatar Aug 25 '20 21:08 jjgao

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 24 '20 09:11 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 13 '21 14:06 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 18 '21 10:12 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 12 '22 03:08 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 17 '23 02:09 stale[bot]

Working on releasing depmap 23Q4 Version.

sbabyanusha avatar Jan 31 '24 19:01 sbabyanusha