cbioportal CNA long format (RFC 65)

Add long data format for CNA as described in RFC 65.

Looking for feedback on the new importer and functionality that should be shared between the new ImportCnaDiscreteLongData and ImportTabDelimData.

Sample of new data format

In the current CNA data format the events of each gene x sample are stored in a tabular format, with a column for each sample and a row for each gene.

The new CNA 'long' data format contains a row for each gene x sample.

Thie new data format will import data using a new importer, the old data format is kept as is. Shared functionality should be extracted as much as possible.

Changes

Add new importer ImportCnaDiscreteLongData for CNA long format
Extract shared functionality of ImportTabDelimData and ImportCnaDiscreteLongData into CnaUtil

Database schema change

Tests

Added new tests for ImportCnaDiscreteLongData in TestImportCnaDiscreteLongData

Oct 12 '22 14:10 BasLee

There is one more suggestion about this pull request, since this pr introduces a new data format, we should also add a new method to validate this new data format in validateData.py

Oct 18 '22 04:10 dippindots

There is one more suggestion about this pull request, since this pr introduces a new data format, we should also add a new method to validate this new data format in validateData.py

Good point! The validator should be updated. I discussed it with @pvannierop @oplantalech and we would like to do this in a separate PR.

Oct 20 '22 12:10 BasLee

@averyniceday Yes, we restrict a study to have only one format. I do not see how the recognition of the new format will differ from any other data types. The meta file for discrete CNA will reference the format used (long or legacy wide). Or am I overlooking a complication here?

Oct 27 '22 08:10 pvannierop

@pvannierop This might be specific to how we import on our end and how the importer is implemented (in conjunction with how we organize datatypes in google sheets). We currently have a one to one mapping of meta to data files (e.g. meta_cna to data_cna). We also have a step inside the importer that merges records for genes (I imagine this is implemented to be specific to the legacy format).

If we are not planning on introducing a new meta/data pairing for the long format (e.g. data_cna_long/meta_cna_long) and instead only differentiating by a field inside the metafile, the importer will need to be updated to be able to differentiate between the two based on the metafile contents.

Oct 31 '22 17:10 averyniceday

@averyniceday The current wide format is still supported by this change (there is 100% backward compatibility). Therefore, I fail to see how this PR could possible interfere with your current processes. This PR merely adds a format. If you internally decide not to use this long format for the time being you should be fine.

And on a note of process, I think it is not really proper procedure to have merging of PRs to cBioPortal codebase to depend on update of internal tooling.

Nov 02 '22 14:11 pvannierop

@pvannierop @BasLee Agreed, this doesn't need to hold up merging into the cbioportal codebase. 👍

Could you also add documentation for the new format in the File Formats section?

Nov 02 '22 15:11 averyniceday

@averyniceday The documentation will be added for sure! We will do this in the PR that updates the python validator script.

Nov 02 '22 16:11 pvannierop

SonarCloud Quality Gate failed.

2 Bugs
0 Vulnerabilities
0 Security Hotspots
24 Code Smells

0.0% Coverage
0.0% Duplication

Nov 03 '22 09:11 sonarqubecloud[bot]

@BasLee CNA long format validator merged, please feel free to rebase this pr.

Nov 14 '22 21:11 dippindots

@BasLee CNA long format validator merged, please feel free to rebase this pr.

Rebased and fixed some validation issues, most (relevant?) checks seem to run now

Nov 16 '22 11:11 BasLee

After some discussion with @pvannierop:

The new DISCRETE_LONG format should probably not be passed to the front end, because the old 'wide' and the new 'long' CNA format should both result in exactly the same DISCRETE data after the import.

Only the importer should have knowledge of the DISCRETE_LONG format, and the importer should update it into DISCRETE. All other CNA logic can just use the old DISCRETE format.

These changes can be found in https://github.com/cBioPortal/cbioportal/pull/9847/commits/9c925246154d91e59103becab727746088fe978a

Nov 17 '22 13:11 BasLee

SonarCloud Quality Gate failed.

3 Bugs
0 Vulnerabilities
0 Security Hotspots
24 Code Smells

0.0% Coverage
0.0% Duplication

Nov 17 '22 13:11 sonarqubecloud[bot]

cbioportal cbioportal copied to clipboard

CNA long format (RFC 65)

Changes

Tests

cbioportal
cbioportal copied to clipboard