cbioportal icon indicating copy to clipboard operation
cbioportal copied to clipboard

Sample profile count disparity

Open alisman opened this issue 4 months ago • 0 comments

Legacy uses gene panel data. When there is NO gene panel (WES?), we get a row per sample because of the join even though both sampleid and panelid will be null! Perhaps we always get a row per sample? And so it doesn't limit the returned set. This sometimes differs from the query of the sample_profile table, which is a subset.

    SELECT sample_id, sample_profile.panel_id
    FROM sample
        INNER JOIN patient ON sample.patient_id = patient.internal_id
    INNER JOIN cancer_study ON patient.cancer_study_id = cancer_study.cancer_study_id
    LEFT JOIN genetic_profile ON cancer_study.cancer_study_id = genetic_profile.cancer_study_id
    LEFT JOIN sample_profile ON sample_profile.genetic_profile_id = genetic_profile.genetic_profile_id
                                    AND sample.internal_id = sample_profile.sample_id
    LEFT JOIN gene_panel ON sample_profile.panel_id = gene_panel.internal_id
    WHERE genetic_profile.stable_id='brain_cptac_2020_mutations'

For example:

SELECT * from sample_profile
    JOIN genetic_profile gp on sample_profile.genetic_profile_id = gp.genetic_profile_id
    WHERE gp.stable_id='brain_cptac_2020_mutations'

The question is, which is correct as a measure of whether a given sample is profiled? The legacy discards any information in sample_profile. What i don't understand is how there could EVER by a subset according to the query above? Since it's a left join it would seem there will always be a row per sample whether or not there is a matching gene panel. And yet some profiles can return subset.

alisman avatar Oct 22 '24 02:10 alisman