somalier icon indicating copy to clipboard operation
somalier copied to clipboard

Inferring sex in 0.2.19

Open fellen31 opened this issue 1 year ago • 4 comments

Hi Brent,

I'm having trouble inferring sex after 7740a33. It should be a high quality sample but n_hom_alt / n_het ~ 0.83, and sex is therefore not updated I think.

https://github.com/brentp/somalier/blob/7740a33a2a0ae35302d223e9dbf14f7e2c353cd5/src/somalierpkg/relate.nim#L460-L461

family_id sample_id paternal_id maternal_id sex phenotype original_pedigree_sex gt_depth_mean gt_depth_sd depth_mean depth_sd ab_mean ab_std n_hom_ref n_het n_hom_alt n_unknown p_middling_ab X_depth_mean X_n X_hom_ref X_het X_hom_alt Y_depth_mean Y_n
- -  0 0 0 2 unknown 31.8 8.9 31.8 9.0 0.52 0.39 4691 6696 5556 441 0.010 17.26 349 168 0 181 18.73 15

I also tested another sample reaching a similar ratio (~0.83). Could you help me understand?

Thanks, Felix

fellen31 avatar Aug 13 '24 08:08 fellen31

Hi, that does seem like a lot of hom-alts. I suppose we could bump 0.7 to another arbitrary cutoff like 0.85, but I wonder if something else is going on with your sample. When you look at the somalier html output, is it an outlier in the plots? Perhaps I should separate out the relatedness inference from the sex inference since the sex inference is quite easy.

brentp avatar Aug 13 '24 14:08 brentp

Thanks for getting back so quickly, I ran the same sample together with 11 others, they all have ratios ranging from 0.81 to 0.93 except one outlier with 1.06. I can't really see anything strange in the plots otherwise, but I might be missing something.

If possible I think that would be great. I'd like to be able to make a best guess for samples with unknown sex to be able to start variant calling and other downstream processes, then once finished the user can go back and look at the html and check whether the inference was correct or not.

fellen31 avatar Aug 15 '24 09:08 fellen31

were the counts for the sample extracted directly from the bam files? if not, how was the VCF called?

brentp avatar Aug 15 '24 13:08 brentp

Directly from the bam files, using sites.hg38.vcf.gz and GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz. It's PacBio long-read data.

fellen31 avatar Aug 15 '24 13:08 fellen31