cbioportal icon indicating copy to clipboard operation
cbioportal copied to clipboard

Clarification needed on how to handle missing gene panel identifiers in data files

Open sheridancbio opened this issue 8 months ago • 1 comments

This relates to cases where a study contains a sample which appears to be part of a genetic profile, but the sample is not present in data_gene_matrix.txt, or the gene panel id value is 'NA' or missing for a sample which is present in data_gene_matrix.txt.

importation into the raw cbioportal data tables (i.e. sample_profile)

During import into the cBioPortal database, the values from data_gene_matrix.txt are loaded into the table sample_profile. According to the file format documentation here: https://docs.cbioportal.org/file-formats/#gene-panel-matrix-file we have this direction: "When the sample is not profiled on a gene panel, or if the sample is not profiled at all, use NA as value. If the sample is profiled for mutations, make sure it is also in the _sequenced case list." I think this specification should be clarified. My reading of this is that:

  • every sample in the study should appear in data_gene_matrix.txt
  • for each genetic_profile (mutations, cna, sv) there is a column to hold gene panel identifiers
  • every cell should be populated, either with the gene panel stable id of a known (or importable) gene panel, or with 'NA'
  • the value 'NA' should be used for non-profiled samples
  • the value 'NA' should be used for mutation profiles when the sequencing was not targeted/paneled (WGS/WES)
  • the definition of whether or not a sample was sequenced for mutations is : does the sample appear in the _sequenced case list (case_lists/cases_sequenced.txt).

I would expect these conditions to be flagged as errors during validation:

  • a sample identifier is present in data_clincial_sample.txt which is not present in data_gene_matrix.txt
  • any cell is empty in data_gene_matrix.txt
  • any cell in any profile column contains a value other than:
    • 'NA'
    • the stable identifier of a gene panel already loaded into the database
    • the stable identifier of a gene panel which is available for importation from the files which make up this cancer study
  • any cell in the "mutations" profile column has a value other than 'NA' for any sample whose stable id is not listed in case_lists/cases_sequenced.txt. (perhaps a similar rule applies to CNA)

If my understanding is correct, I think the documentation should be made more specific to assert these rules clearly. Additionally, the importer codebase should be tested. It appears that currently it is permissible for samples to be unmentioned / absent from data_gene_matrix.txt and that import can still succeed. The results for a sample which is not mentioned in data_gene_matrix.txt seems to depend on whether or not detected mutation events are present in data_mutations_extended.txt .. so that samples which are mentioned in case_lists/cases_sequenced.txt but which have no detected mutation events and which are not listed in data_gene_matrix.txt are imported into the database (sample_profile) without a recorded gene panel and appear to be unsequenced in certain contexts. Importer unit tests should be written for all condition combinations (presence/absence in data_gene_matrix.txt, mutations column value (NA / valid_panel / invalid_panel_id), presence/absence in case_lists/cases_sequenced.txt, samples with/without detected importable (non-silent) mutations in data_mutations_extended.txt) and the business logic should be adjusted to properly handle each test case. The validator should be also updated to properly validate the requirements.

Another thing to be specified is what representation should be present (if any) in the database table sample_profile for a sample which was:

  • not sequenced, or
  • sequenced with WGS/WES

The PANEL_ID field is an integer. If a sample was not sequenced should it be present or absent from sample_profile ... and if present, should the PANEL_ID value be null? If a sample was sequenced with WGS/WES sequencing should it be present, and if so, what value should PANEL_ID hold?

sheridancbio avatar Jun 27 '24 19:06 sheridancbio