gemini
gemini copied to clipboard
release prep
It's time to make a new release this issue will track what needs to be done. Please add any essential (and only essential) updates as comments to this issue:
- [x] update clinvar
- [x] update CADD to v1.4
- [x] update dbsnp
- [x] update gnomad and controls and non-neuro control AFs (gnomad 2.1)
- [x] update dgidb url
- [x] for x-linked recessive exclude sites where the male is het and parents are hom ref. (see #903)
- [x] #843 (don't report x-linked DN's when one parent is missing).
- [ ] #902 update gene_summary and gene_detailed
presumably issue #903 is included as well as issue #843
When you update gnomad, will you include the subcohorts (i.e. frequencies from the controls-only and non-neuro cohorts)?
I updated the list to include those items. thanks for the reminders
Taking de novos into account in the comp_het inheritance model would be nice!
I think it would be great to add support for de novos when reporting candidate CHs. The challenge is that one cannot be confident that it causes a true CH without read-backed phasing in the common case of trios. As such, this would have to either be part of the relaxed priority CH searches or a specific mode.
Yeah, the variant could be on the same allele as the inherited one and thus not be a real comp_het, but perhaps a new lenient mode, or a flag for these specific cases would help. Also, this applies for diploid organisms (human specifically, in my case), for other ploidies, things could get ugly fast.
Agree it'd be nice. I've used this hack to get at the results:
Run comp_hets --max-priority 2
which will give you all potential CH pairs including those that are de_novo + 2nd allele. This yields a lot of false positives, but you can winnow down the results by cross-referencing the gene list with the results of de_novo
and looking for overlapping genes.
@jxchong, thank you for that hint!
I'll see what I can do about comp-het + DN. Though it can't be phased, it could be a special category.
@brentp https://github.com/arq5x/gemini/issues/911 and https://github.com/arq5x/gemini/issues/910 are two regressions from the 0.20 version.
Looks like dgidb-related functionality only requires a URL update to get it working again.
@wm75 can you make a PR or let me know the URL? I'm not familiar with dgidb.
Sure, just need to get my laptop to look it up.
Ok, so you'd use:
dgidb_url = 'http://www.dgidb.org/api/v2/interactions.json?genes='
to re-enable things, at least, that's the minimal change.
What I'd prefer though would be if URLs (the dgidb, the install-data url, others?) became part of the config file. This would give, e.g., Galaxy Admins a single place to locate these URLs and update them if they break at some point. I realize that is a slightly bigger change, but if you have a little bit of time before the release, I think it would help a lot with long-term support of any gemini install.
@bgruening we've been talking about this before. The above line is the patch to https://github.com/arq5x/gemini/blob/8a3e7571b15cec31f701161f6efe38bc624be028/gemini/dgidb.py#L28 that you'd have to apply to gemini on usegalaxy.eu to make the actionable mutations and the --dgidb option of the query tool work again.
great. thank you! so the dgidb stuff is getting used? If so, I can keep it as long as it's this simple URL change to fix.
@wm75 fixed on our server.
@bgruening it's working :smile:
@brentp yeah, we have an ongoing project with clinicians, for whom being able to get at gene-drug interactions is a major selling point. So I'd rather like to see more such functionality than less.
When you update gnomad, will you include the subcohorts (i.e. frequencies from the controls-only and non-neuro cohorts)?
@jxchong are these the only additions from gnomad? I'll put a list of what I plan to add for final verification, but an initial set would be helpful.
Any possibility of updating CADD to the new v1.4?
I have added the CADD update to the list.
All, I have implemented the comp-het + denovo (CHDN) in a specific way and I'm looking for problems I might not have foreseen. Currently, the highest priority is 1. I have made a "good" CHDN candidate have a priority of 1.5 so one can filter with that (I will change the gemini parameter to accept floats).
A good candidate for kid, mom, dad would be:
# het inherited from mom:
A/T, A/T, A/A
# DN
C/T, C/C, C/C
A somewhat arbitrary requirement that I have added is that the DN HET can not occur in any unaffected sample in the family. So, if we add an unaffected sib to the above trio so it's kid, mom, dad, sib and the sib also has the de novo:
# het inherited from mom only in proband:
A/T, A/T, A/A, A/A
# DN in both kids
C/T, C/C, C/C, C/T
this is not reported as a candidate since an unaffected (sib) shares the DN. This could miss cases where it was a germ-line mosaic, but I think it will also remove a lot of false-positives.
Any thoughts on this approach? I want to avoid extra flags and arguments and just provide a 95% solution here.
Another possible addition is gnomad exomes update to 2.1: http://gnomad.broadinstitute.org/downloads
Update -- never mind, just saw you already mentioned gnomad exomes in your first list.
@oleraj that's part of the gnomad item above. I'm testing that now. The file is much larger so I'm trying to cull it a bit.
Will this be a separate genetic model, or will it be part of the existing comp_hets? The approach seems fine to me for now (others should weigh in too).
it will be part of comp-het and not require any change (other than bumping the max-priority to > 1.5
@brentp Approach seems fine to me as well. If you are adding CHDN, it'd be nice to add simple multi-gen support to the de_novo model as requested in #885 (i.e. de novo in generation 1, and passed down as autosomal dominant in child in generation 2) as de novo->AD is more common than CHDN. I think it'd be relatively simple to add without requiring a ton of flags because I think you could simply change the requirements such that in a given family, the variant must be de novo at least once (two unaffected parents who are not carriers and their carrier child) while allowing for affected offspring of that child to be het.
@jxchong would you have a look at: https://github.com/arq5x/gemini/commit/513f53fe2a2e07c25af6e7dd27db74ac25388339 ? and verify it meets your needs for the non-neuro and controls AFs? I just took those 2 fields from the VCF and add them to the database.
@brentp commit 513f53f looks good. Do you think others might have a use for the non-cancer control set too? Perhaps yes, especially when analyzing mosaic variants? @oleraj
Is anyone against removing ESP? Given it's size relative to newer resources, it would drop a few columns...