sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Consider standardizing the API of pca and pc_relate

Open timothymillar opened this issue 2 years ago • 4 comments

These were implemented early on, before the schema/variables where standardized. I'm not sure if there's any appetite to update them, but here are a few observations:

  • Both use custom methods to calculate an equivalent to call_dosage.

    • pca creates an undocumented variable called "call_alternate_allele_count" from call_allele_count (line)
    • pc_relate internally creates alternate allele counts directly from the call_genotype (line)
    • They would likely be more flexible if defined in terms of the call_dosage variable
    • call_dosage could be automatically computed from call_genotype for backwards compatibility, see #282
  • pc_relate does its own filtering by MAF. This doesn't seem to be the approach taken elsewhere, but I'm not sure that we have a "standard" approach to filtering?

I was looking at the implementation after considering having "pc-relate" as an estimator option in genomic_relationship. AFAICT this should work so long as a sample_pc parameter was added (similar to ancestral_frequency for the VanRaden estimator). This would return relationships (as opposed to pc_relate_phi which is an estimate of kinship).

timothymillar avatar Apr 27 '23 08:04 timothymillar

cc @ravwojdyla and @eric-czech in case they have thoughts

hammer avatar Apr 27 '23 12:04 hammer

pc_relate does its own filtering by MAF. This doesn't seem to be the approach taken elsewhere, but I'm not sure that we have a "standard" approach to filtering?

@timothymillar the sgkit pc_relate was based of off the Hail implementation (https://github.com/pystatgen/sgkit/issues/24) also compared to GENESIS, which both handle MAF filter. +1 to standardizing, please let me know if I can help in any way.

ravwojdyla avatar May 01 '23 01:05 ravwojdyla

pca creates an undocumented variable called "call_alternate_allele_count" from call_allele_count

+1 to standardizing as much as possible as well, but it will be important to keep in mind that the default scaler for pca only works with true alt allele counts. The relevant docs for it: https://github.com/pystatgen/sgkit/blob/0bccf6a660d04c95c1ac73a2408cc0deb11f27a9/sgkit/stats/preprocessing.py#L15-L23

IIRC this was to replicate scikit-allel PCA and get close to parity with hwe_normalized_pca by default. If we wanted to make this more generic so that it could run on dosages, then I would suggest this:

  1. We leave pca as it is
  2. We add two overload functions like decompose_call_alternate_alleles and decompose_dosages (naming consistent with sgkit.stats.aggregation as well as familiar for scikit-learn users) that take away the choice for variable and scaler, and call pca with appropriate settings for them

please let me know if I can help in any way.

Ditto! Let me know if any of that isn't clear or you see a better way.

eric-czech avatar May 01 '23 12:05 eric-czech

Thanks @eric-czech, decompose_dosages is a good name. Just a thought, the call_dosage variable can optionally be an integer type. So, we could potentially still use it with the PattersonScaler and raise an error if it contains floats. Integer alternate allele counts for biallelic variants can already be computed using convert_call_to_index.

So, we could have a general API where the default variable is call_dosage. If absent, call_dosage is automatically computed as an integer array using convert_call_to_index. An error is raised If the patterson scalar is used and call_dosage is already an array of floats.

timothymillar avatar May 12 '23 02:05 timothymillar