Consider standardizing the API of pca and pc_relate
These were implemented early on, before the schema/variables where standardized. I'm not sure if there's any appetite to update them, but here are a few observations:
-
Both use custom methods to calculate an equivalent to
call_dosage.pcacreates an undocumented variable called"call_alternate_allele_count"fromcall_allele_count(line)- pc_relate internally creates alternate allele counts directly from the
call_genotype(line) - They would likely be more flexible if defined in terms of the
call_dosagevariable call_dosagecould be automatically computed fromcall_genotypefor backwards compatibility, see #282
-
pc_relatedoes its own filtering by MAF. This doesn't seem to be the approach taken elsewhere, but I'm not sure that we have a "standard" approach to filtering?
I was looking at the implementation after considering having "pc-relate" as an estimator option in genomic_relationship. AFAICT this should work so long as a sample_pc parameter was added (similar to ancestral_frequency for the VanRaden estimator). This would return relationships (as opposed to pc_relate_phi which is an estimate of kinship).
cc @ravwojdyla and @eric-czech in case they have thoughts
pc_relate does its own filtering by MAF. This doesn't seem to be the approach taken elsewhere, but I'm not sure that we have a "standard" approach to filtering?
@timothymillar the sgkit pc_relate was based of off the Hail implementation (https://github.com/pystatgen/sgkit/issues/24) also compared to GENESIS, which both handle MAF filter. +1 to standardizing, please let me know if I can help in any way.
pca creates an undocumented variable called "call_alternate_allele_count" from call_allele_count
+1 to standardizing as much as possible as well, but it will be important to keep in mind that the default scaler for pca only works with true alt allele counts. The relevant docs for it: https://github.com/pystatgen/sgkit/blob/0bccf6a660d04c95c1ac73a2408cc0deb11f27a9/sgkit/stats/preprocessing.py#L15-L23
IIRC this was to replicate scikit-allel PCA and get close to parity with hwe_normalized_pca by default. If we wanted to make this more generic so that it could run on dosages, then I would suggest this:
- We leave
pcaas it is - We add two overload functions like
decompose_call_alternate_allelesanddecompose_dosages(naming consistent withsgkit.stats.aggregationas well as familiar for scikit-learn users) that take away the choice forvariableandscaler, and callpcawith appropriate settings for them
please let me know if I can help in any way.
Ditto! Let me know if any of that isn't clear or you see a better way.
Thanks @eric-czech, decompose_dosages is a good name. Just a thought, the call_dosage variable can optionally be an integer type. So, we could potentially still use it with the PattersonScaler and raise an error if it contains floats. Integer alternate allele counts for biallelic variants can already be computed using convert_call_to_index.
So, we could have a general API where the default variable is call_dosage. If absent, call_dosage is automatically computed as an integer array using convert_call_to_index. An error is raised If the patterson scalar is used and call_dosage is already an array of floats.