Inferring sex in 0.2.19
Hi Brent,
I'm having trouble inferring sex after 7740a33. It should be a high quality sample but n_hom_alt / n_het ~ 0.83, and sex is therefore not updated I think.
https://github.com/brentp/somalier/blob/7740a33a2a0ae35302d223e9dbf14f7e2c353cd5/src/somalierpkg/relate.nim#L460-L461
| family_id | sample_id | paternal_id | maternal_id | sex | phenotype | original_pedigree_sex | gt_depth_mean | gt_depth_sd | depth_mean | depth_sd | ab_mean | ab_std | n_hom_ref | n_het | n_hom_alt | n_unknown | p_middling_ab | X_depth_mean | X_n | X_hom_ref | X_het | X_hom_alt | Y_depth_mean | Y_n |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| - | - | 0 | 0 | 0 | 2 | unknown | 31.8 | 8.9 | 31.8 | 9.0 | 0.52 | 0.39 | 4691 | 6696 | 5556 | 441 | 0.010 | 17.26 | 349 | 168 | 0 | 181 | 18.73 | 15 |
I also tested another sample reaching a similar ratio (~0.83). Could you help me understand?
Thanks, Felix
Hi, that does seem like a lot of hom-alts. I suppose we could bump 0.7 to another arbitrary cutoff like 0.85, but I wonder if something else is going on with your sample. When you look at the somalier html output, is it an outlier in the plots? Perhaps I should separate out the relatedness inference from the sex inference since the sex inference is quite easy.
Thanks for getting back so quickly, I ran the same sample together with 11 others, they all have ratios ranging from 0.81 to 0.93 except one outlier with 1.06. I can't really see anything strange in the plots otherwise, but I might be missing something.
If possible I think that would be great. I'd like to be able to make a best guess for samples with unknown sex to be able to start variant calling and other downstream processes, then once finished the user can go back and look at the html and check whether the inference was correct or not.
were the counts for the sample extracted directly from the bam files? if not, how was the VCF called?
Directly from the bam files, using sites.hg38.vcf.gz and GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz. It's PacBio long-read data.