cbioportal icon indicating copy to clipboard operation
cbioportal copied to clipboard

CNA long format (RFC 65)

Open BasLee opened this issue 2 years ago • 1 comments

Add long data format for CNA as described in RFC 65.

Looking for feedback on the new importer and functionality that should be shared between the new ImportCnaDiscreteLongData and ImportTabDelimData.

image Sample of new data format

In the current CNA data format the events of each gene x sample are stored in a tabular format, with a column for each sample and a row for each gene.

The new CNA 'long' data format contains a row for each gene x sample.

Thie new data format will import data using a new importer, the old data format is kept as is. Shared functionality should be extracted as much as possible.

Changes

  • Add new importer ImportCnaDiscreteLongData for CNA long format
  • Extract shared functionality of ImportTabDelimData and ImportCnaDiscreteLongData into CnaUtil

image Database schema change

Tests

  • Added new tests for ImportCnaDiscreteLongData in TestImportCnaDiscreteLongData

BasLee avatar Oct 12 '22 14:10 BasLee

There is one more suggestion about this pull request, since this pr introduces a new data format, we should also add a new method to validate this new data format in validateData.py

dippindots avatar Oct 18 '22 04:10 dippindots

There is one more suggestion about this pull request, since this pr introduces a new data format, we should also add a new method to validate this new data format in validateData.py

Good point! The validator should be updated. I discussed it with @pvannierop @oplantalech and we would like to do this in a separate PR.

BasLee avatar Oct 20 '22 12:10 BasLee

@averyniceday Yes, we restrict a study to have only one format. I do not see how the recognition of the new format will differ from any other data types. The meta file for discrete CNA will reference the format used (long or legacy wide). Or am I overlooking a complication here?

pvannierop avatar Oct 27 '22 08:10 pvannierop

@pvannierop This might be specific to how we import on our end and how the importer is implemented (in conjunction with how we organize datatypes in google sheets). We currently have a one to one mapping of meta to data files (e.g. meta_cna to data_cna). We also have a step inside the importer that merges records for genes (I imagine this is implemented to be specific to the legacy format).

If we are not planning on introducing a new meta/data pairing for the long format (e.g. data_cna_long/meta_cna_long) and instead only differentiating by a field inside the metafile, the importer will need to be updated to be able to differentiate between the two based on the metafile contents.

averyniceday avatar Oct 31 '22 17:10 averyniceday

@averyniceday The current wide format is still supported by this change (there is 100% backward compatibility). Therefore, I fail to see how this PR could possible interfere with your current processes. This PR merely adds a format. If you internally decide not to use this long format for the time being you should be fine.

And on a note of process, I think it is not really proper procedure to have merging of PRs to cBioPortal codebase to depend on update of internal tooling.

pvannierop avatar Nov 02 '22 14:11 pvannierop

@pvannierop @BasLee Agreed, this doesn't need to hold up merging into the cbioportal codebase. 👍

Could you also add documentation for the new format in the File Formats section?

averyniceday avatar Nov 02 '22 15:11 averyniceday

@averyniceday The documentation will be added for sure! We will do this in the PR that updates the python validator script.

pvannierop avatar Nov 02 '22 16:11 pvannierop

@BasLee CNA long format validator merged, please feel free to rebase this pr.

dippindots avatar Nov 14 '22 21:11 dippindots

@BasLee CNA long format validator merged, please feel free to rebase this pr.

Rebased and fixed some validation issues, most (relevant?) checks seem to run now

BasLee avatar Nov 16 '22 11:11 BasLee

After some discussion with @pvannierop:

The new DISCRETE_LONG format should probably not be passed to the front end, because the old 'wide' and the new 'long' CNA format should both result in exactly the same DISCRETE data after the import.

Only the importer should have knowledge of the DISCRETE_LONG format, and the importer should update it into DISCRETE. All other CNA logic can just use the old DISCRETE format.

These changes can be found in https://github.com/cBioPortal/cbioportal/pull/9847/commits/9c925246154d91e59103becab727746088fe978a

BasLee avatar Nov 17 '22 13:11 BasLee