sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Populate parent_id variable in `read_plink`

Open tomwhite opened this issue 2 years ago • 2 comments

We can use the family information from plink (https://www.cog-genomics.org/plink/1.9/formats#fam) to populate the sgkit parent_id variable (https://pystatgen.github.io/sgkit/latest/generated/sgkit.variables.parent_id_spec.html).

tomwhite avatar Jan 30 '23 11:01 tomwhite

Need to be careful about potential zero and one-based differences here. What do we currently do with plink-like pedigrees @timothymillar ?

jeromekelleher avatar Jan 30 '23 17:01 jeromekelleher

What do we currently do with plink-like pedigrees

Currently we don't offer any pedigree IO, partly because there're a lot of formats! I've opened #1012 to document some generic examples, but it makes sense to have built in support for our primary formats (plink and vcf). It looks like the plink definitions should be fine to use.

careful about potential zero and one-based differences here

In theory this should already be handled correctly. The plink ids are 1-based and go into the sample_id and parent_id arrays. These are then used as generic hashable keys to generate the pedigree array of 0-based indices. This should only require specifying 0 as the 'missing' sample ID (which is translated to -1 in the parent array).

timothymillar avatar Jan 30 '23 22:01 timothymillar