cbioportal
cbioportal copied to clipboard
CNA long format (RFC 65)
Add long data format for CNA as described in RFC 65.
Looking for feedback on the new importer and functionality that should be shared between the new ImportCnaDiscreteLongData and ImportTabDelimData.
Sample of new data format
In the current CNA data format the events of each gene x sample are stored in a tabular format, with a column for each sample and a row for each gene.
The new CNA 'long' data format contains a row for each gene x sample.
Thie new data format will import data using a new importer, the old data format is kept as is. Shared functionality should be extracted as much as possible.
Changes
- Add new importer
ImportCnaDiscreteLongData
for CNA long format - Extract shared functionality of
ImportTabDelimData
andImportCnaDiscreteLongData
intoCnaUtil
Database schema change
Tests
- Added new tests for
ImportCnaDiscreteLongData
inTestImportCnaDiscreteLongData
There is one more suggestion about this pull request, since this pr introduces a new data format, we should also add a new method to validate this new data format in validateData.py
There is one more suggestion about this pull request, since this pr introduces a new data format, we should also add a new method to validate this new data format in validateData.py
Good point! The validator should be updated. I discussed it with @pvannierop @oplantalech and we would like to do this in a separate PR.
@averyniceday Yes, we restrict a study to have only one format. I do not see how the recognition of the new format will differ from any other data types. The meta file for discrete CNA will reference the format used (long or legacy wide). Or am I overlooking a complication here?
@pvannierop This might be specific to how we import on our end and how the importer is implemented (in conjunction with how we organize datatypes in google sheets). We currently have a one to one mapping of meta to data files (e.g. meta_cna to data_cna). We also have a step inside the importer that merges records for genes (I imagine this is implemented to be specific to the legacy format).
If we are not planning on introducing a new meta/data pairing for the long format (e.g. data_cna_long/meta_cna_long) and instead only differentiating by a field inside the metafile, the importer will need to be updated to be able to differentiate between the two based on the metafile contents.
@averyniceday The current wide format is still supported by this change (there is 100% backward compatibility). Therefore, I fail to see how this PR could possible interfere with your current processes. This PR merely adds a format. If you internally decide not to use this long format for the time being you should be fine.
And on a note of process, I think it is not really proper procedure to have merging of PRs to cBioPortal codebase to depend on update of internal tooling.
@pvannierop @BasLee Agreed, this doesn't need to hold up merging into the cbioportal codebase. 👍
Could you also add documentation for the new format in the File Formats section?
@averyniceday The documentation will be added for sure! We will do this in the PR that updates the python validator script.
@BasLee CNA long format validator merged, please feel free to rebase this pr.
@BasLee CNA long format validator merged, please feel free to rebase this pr.
Rebased and fixed some validation issues, most (relevant?) checks seem to run now
After some discussion with @pvannierop:
The new DISCRETE_LONG
format should probably not be passed to the front end, because the old 'wide' and the new 'long' CNA format should both result in exactly the same DISCRETE
data after the import.
Only the importer should have knowledge of the DISCRETE_LONG
format, and the importer should update it into DISCRETE
. All other CNA logic can just use the old DISCRETE
format.
These changes can be found in https://github.com/cBioPortal/cbioportal/pull/9847/commits/9c925246154d91e59103becab727746088fe978a