sgkit
sgkit copied to clipboard
Populate parent_id variable in `read_plink`
We can use the family information from plink (https://www.cog-genomics.org/plink/1.9/formats#fam) to populate the sgkit parent_id variable (https://pystatgen.github.io/sgkit/latest/generated/sgkit.variables.parent_id_spec.html).
Need to be careful about potential zero and one-based differences here. What do we currently do with plink-like pedigrees @timothymillar ?
What do we currently do with plink-like pedigrees
Currently we don't offer any pedigree IO, partly because there're a lot of formats! I've opened #1012 to document some generic examples, but it makes sense to have built in support for our primary formats (plink and vcf). It looks like the plink definitions should be fine to use.
careful about potential zero and one-based differences here
In theory this should already be handled correctly. The plink ids are 1-based and go into the sample_id and parent_id arrays. These are then used as generic hashable keys to generate the pedigree array of 0-based indices. This should only require specifying 0 as the 'missing' sample ID (which is translated to -1 in the parent array).