cbioportal icon indicating copy to clipboard operation
cbioportal copied to clipboard

Additional review needed for the identification of WGS/WES samples in clickhouse table development

Open sheridancbio opened this issue 8 months ago • 1 comments

This relates to cases where a study contains a sample which appears to be part of a genetic profile, but the sample is not present in data_gene_matrix.txt, or the gene panel id value is 'NA' or missing for a sample which is present in data_gene_matrix.txt.

translation of raw cbioportal database tables into derived clickhouse tables (e.g. sample_to_gene_panel_derived)

Scripts have been developed to produce flattened tables and views for clickhouse development efforts underway. See: https://github.com/cBioPortal/cbioportal/blob/79d36e73f1aeff6d0ab4697e77aa210752772ad6/src/main/resources/db-scripts/clickhouse/clickhouse.sql#L17

These scripts attempt to connect the PANEL_ID field from the sample_profile table to the panels present in the gene_panel table, and if there is no connecting gene panel then the value 'WES' is used in place of the (missing) gene panel stable id. This logic should be considered in combination with discussions around #10871, where 'NA' values in data_gene_matrix.txt might or might not be present and the resulting imported data might or might not introduce record into sample_profile based on the presence of detected non-silent mutations importer into the mutations table for the sample.

Once the expected data representation in sample_profile is determined and specified for WGS/WES and for non-profiled samples, the logic in these scripts should be examined and updated if necessary.

sheridancbio avatar Jun 27 '24 20:06 sheridancbio