gatk
gatk copied to clipboard
New Tool: Reference Comparator
We need a tool to compare multiple references and spit out a TSV (or similar) detailing what the differences are. Additionally it should be able to spit out a liftover file that will properly move a variant from one reference to another.
We should first compare the sequence dictionaries in the references to see if they have equal lengths and checksums - the names may differ and we should track this so we can definitively say which contigs are equivalent. After this, we should walk the references and find out specifically which bases differ between contigs that have different checksums (with some limits on the number of differences between them so we don't get bogged down by hg19
vs hg38
comparisons).
Then it should create a liftover file from those comparisons so the data can be easily converted between the references given.
Additionally, it should be able to take a variant file and a set of references and say:
- whether the variant file "belongs" to one of the given references
- if it isn't exactly from one of the given references, which reference is closest
- optionally: a lifted-over version of that VCF to the closest reference (with a bunch of warnings, if applicable)
This will finally lay to rest the questions raised by my blog post about "HG19".
I believe Adam Phillipy had created a perl script that does something similar to this, but a brief view of his github page doesn't show anything like that anymore (maybe it was called refdiff
or similar).
I created a bash script that does something similar to this (see attached), but it only looks at the sequence dictionaries. It produces a table similar to that in the above blog post. For example:
MD5 | HG38(Homo_sapiens_assembly38.dict) | HG38_WEIRD(genome.hg38rg.fa.dict) |
---|---|---|
1e95e047b98ed92148dd84d6c037158c | chr1_KI270708v1_random | 1_KI270708v1_random |
42f7a452b8b769d051ad738ee9f00631 | chr1_KI270714v1_random | 1_KI270714v1_random |
4e2db2933ea96aee8dab54af60ecb37d | chr1_KI270709v1_random | 1_KI270709v1_random |
62def1a794b3e18192863d187af956e6 | chr1_KI270706v1_random | 1_KI270706v1_random |
6aef897c3d6ff0c78aff06ac189178dd | chr1 | 1 |
78135804eb15220565483b7cdd02f3be | chr1_KI270707v1_random | 1_KI270707v1_random |
... |
compareTwoReferenceDictionaries.tar.gz
@droazen @bhanugandham - we can discuss what other features we would want for this.
Good idea. Another use case for this would be to take a bam/cram and set of references, and see if there is a reference suitable for the bam/cram, which would super useful for finding the reference for an existing CRAM, and also for finding a suitable reference to use to do a bam->cram conversion (which comes up a lot when adding cram tests).
This is great! Thanks for putting this together Jonn.
Could we also add the option to compare a vcf and a reference to see if the vcf was generated using that reference? Sometime users have vcfs from a previous study and don't know for sure if they are using the right ref. We see this often. Users want to use post-variantcalling tools and end up getting weird errors due to a wrong reference.
@KevinCLydon Reassigning this one to you, as discussed.
@jonn-smith I am working on this ticket and was looking for the script "compareTwoReferenceDictionaries.tar.gz" you uploaded on the ticket. Looks like this is not the actual file but it gives a soft link with a path on your laptop. Could you point me to the script? Thanks Kishori
@jonn-smith What is the requirement here exactly?
optionally: a lifted-over version of that VCF to the closest reference (with a bunch of warnings, if applicable)
Is it to generate a liftover chain file from two arbitrary references and then do the liftover of the VCF? Would this handle big changes, such as hg19 to hg38? Or only for lifting over similar references, such as b37 to hg19?
I think the idea is to just generate the liftover file between two arbitrary references, not to actually do the liftover.
And yes, ideally it should handle things like hg19 -> hg38, but I'll defer to @jonn-smith on how feasible that is.
If the liftover aspect of the ticket is too difficult, it would be fine to have an initial version of the tool that just prints the differences between two references, and then we can add the liftover capability later.
@kishorikonwar Ah! My bad. I'll post it in a sec.
@LeeTL1220 @droazen My initial goal was to compare several "equivalent" genome FASTA files and produce two things:
- A table like the one I made here which has all the contigs that are the same and different between all input references
- For all contigs that differ, the actual base differences and positions at which they differ. I had envisioned this output to be a VCF file with metadata in the INFO field but now that I'm thinking about it another table with each reference as a column and each row as a position and the base at that position makes more sense (and would be easier to implement).
This would fix the issues that I've left as open questions for how certain versions of "hg19" actually differ. I have since looked into it with that script and I have an answer for some of the positions. I hope we all can handle IUPAC bases in our references! Creating a liftover file in this case would be really nice, and should take minimal effort.
If we're opening the tool to dissimilar references for the same organism, then there are some really tricky issues. What if Reference A has a frameshift relative to Reference B? What is the right way to display / report a pairwise comparison between all references together concisely (if it isn't a table)? Creating a liftover file in cases like this (e.g. hg19 -> hg38) is non-trivial.
@kishorikonwar I had checked in the script a while ago, so that sim link just points to the version in GATK:
https://github.com/broadinstitute/gatk/blob/master/scripts/funcotator/testing/compareTwoReferenceDictionaries.sh This only does the sequence dictionary check - not the base-level check, and it's not very sophistocated.
More recently I wrote a script for doing the nucleotide diffs in python. I put it in the jts_refdiff
branch here:
https://github.com/broadinstitute/gatk/blob/jts_refdiff/scripts/refDiff.py
@kishorikonwar It's worth noting that there are some instructions on how to create a chain file / liftover file from UCSC here:
http://genomewiki.ucsc.edu/index.php/LiftOver_Howto
@jonn-smith Thank you. I will be looking at the ucsc. I also found the following tool that implements the ucsc liftover file creation....the logic seems simple. https://github.com/konstantint/pyliftover
@kishorikonwar No problem. One quick note - I'm not sure that pyliftover
actually creates the chain
files to do the liftover. It looks like it can only retrieve and apply chain file liftovers to a given set of coordinates / VCFs (as opposed to creating the chain file which specifies the differences between references).
It's worth checking out for how to do the liftover once the chain file has been created, but to make the chain file itself I think you'll need to reference the UCSC howto or other documentation.
If we were to have this tool create chain files for liftovers then we could create chain files for arbitrary liftovers (e.g. hg19 -> CanFam3.1), which some people might find useful.
Also, another reference to take a quick look at is https://github.com/alshai/levioSAM - I haven't looked closely, but it does something similar.
@jonn-smith @LeeTL1220 @droazen Thanks for sharing the information above, and I looked at it. It seems to me that once we have a chain file for one reference and another reference, the remaining steps are straightforward. I also noticed the following Picard utility Picard LiftoverVcf that can Lift over a VCF file from one reference to another. Therefore, creating the chain file between a pair of references (and limiting ourselves to cases where both references are from the same species, mouse/human) is the key. To that end, according to the following post List of chain file creators most of the chain file creation tools are available as a web interface. However, the UCSC one seems to be more popular, and fortunately, they have the utilities as open source and to some degree explain their steps in the LiverOver_Howto link you sent. With this approach, they first BLAT the pairwise contigs in the reference files and then use the utility DoSameSpeciesLiftOver.pl.
Based on this, it appears to me I should think about the following steps: a) First, try out their code (UCSC) and make sure it works to produce chain files for two references successfully. b) Design/propose a solution putting the logic in DoSameSpeciesLiftOver.pl into GATK, which also might need a BLAT run
Let me know what you think of this or have any suggestions about how I should proceed.
As per discussion with Kishori, he is going to work on a different issue, since it is more inline with his work. He can come back to this if nobody has picked it up.
Re-assigning to @KevinCLydon
Re-assigning to @orlicohen