gatk icon indicating copy to clipboard operation
gatk copied to clipboard

New Tool: Reference Comparator

Open jonn-smith opened this issue 4 years ago • 16 comments

We need a tool to compare multiple references and spit out a TSV (or similar) detailing what the differences are. Additionally it should be able to spit out a liftover file that will properly move a variant from one reference to another.

We should first compare the sequence dictionaries in the references to see if they have equal lengths and checksums - the names may differ and we should track this so we can definitively say which contigs are equivalent. After this, we should walk the references and find out specifically which bases differ between contigs that have different checksums (with some limits on the number of differences between them so we don't get bogged down by hg19 vs hg38 comparisons). Then it should create a liftover file from those comparisons so the data can be easily converted between the references given.

Additionally, it should be able to take a variant file and a set of references and say:

  • whether the variant file "belongs" to one of the given references
  • if it isn't exactly from one of the given references, which reference is closest
  • optionally: a lifted-over version of that VCF to the closest reference (with a bunch of warnings, if applicable)

This will finally lay to rest the questions raised by my blog post about "HG19".

I believe Adam Phillipy had created a perl script that does something similar to this, but a brief view of his github page doesn't show anything like that anymore (maybe it was called refdiff or similar).

I created a bash script that does something similar to this (see attached), but it only looks at the sequence dictionaries. It produces a table similar to that in the above blog post. For example:

MD5 HG38(Homo_sapiens_assembly38.dict) HG38_WEIRD(genome.hg38rg.fa.dict)
1e95e047b98ed92148dd84d6c037158c chr1_KI270708v1_random 1_KI270708v1_random
42f7a452b8b769d051ad738ee9f00631 chr1_KI270714v1_random 1_KI270714v1_random
4e2db2933ea96aee8dab54af60ecb37d chr1_KI270709v1_random 1_KI270709v1_random
62def1a794b3e18192863d187af956e6 chr1_KI270706v1_random 1_KI270706v1_random
6aef897c3d6ff0c78aff06ac189178dd chr1 1
78135804eb15220565483b7cdd02f3be chr1_KI270707v1_random 1_KI270707v1_random
...

compareTwoReferenceDictionaries.tar.gz

@droazen @bhanugandham - we can discuss what other features we would want for this.

jonn-smith avatar Sep 21 '20 17:09 jonn-smith

Good idea. Another use case for this would be to take a bam/cram and set of references, and see if there is a reference suitable for the bam/cram, which would super useful for finding the reference for an existing CRAM, and also for finding a suitable reference to use to do a bam->cram conversion (which comes up a lot when adding cram tests).

cmnbroad avatar Sep 21 '20 18:09 cmnbroad

This is great! Thanks for putting this together Jonn.

Could we also add the option to compare a vcf and a reference to see if the vcf was generated using that reference? Sometime users have vcfs from a previous study and don't know for sure if they are using the right ref. We see this often. Users want to use post-variantcalling tools and end up getting weird errors due to a wrong reference.

bhanugandham avatar Sep 23 '20 05:09 bhanugandham

@KevinCLydon Reassigning this one to you, as discussed.

droazen avatar Mar 15 '21 18:03 droazen

@jonn-smith I am working on this ticket and was looking for the script "compareTwoReferenceDictionaries.tar.gz" you uploaded on the ticket. Looks like this is not the actual file but it gives a soft link with a path on your laptop. Could you point me to the script? Thanks Kishori

kishorikonwar avatar Mar 15 '21 20:03 kishorikonwar

@jonn-smith What is the requirement here exactly?

optionally: a lifted-over version of that VCF to the closest reference (with a bunch of warnings, if applicable)

Is it to generate a liftover chain file from two arbitrary references and then do the liftover of the VCF? Would this handle big changes, such as hg19 to hg38? Or only for lifting over similar references, such as b37 to hg19?

LeeTL1220 avatar Mar 15 '21 20:03 LeeTL1220

I think the idea is to just generate the liftover file between two arbitrary references, not to actually do the liftover.

droazen avatar Mar 15 '21 20:03 droazen

And yes, ideally it should handle things like hg19 -> hg38, but I'll defer to @jonn-smith on how feasible that is.

If the liftover aspect of the ticket is too difficult, it would be fine to have an initial version of the tool that just prints the differences between two references, and then we can add the liftover capability later.

droazen avatar Mar 15 '21 20:03 droazen

@kishorikonwar Ah! My bad. I'll post it in a sec.

@LeeTL1220 @droazen My initial goal was to compare several "equivalent" genome FASTA files and produce two things:

  1. A table like the one I made here which has all the contigs that are the same and different between all input references
  2. For all contigs that differ, the actual base differences and positions at which they differ. I had envisioned this output to be a VCF file with metadata in the INFO field but now that I'm thinking about it another table with each reference as a column and each row as a position and the base at that position makes more sense (and would be easier to implement).

This would fix the issues that I've left as open questions for how certain versions of "hg19" actually differ. I have since looked into it with that script and I have an answer for some of the positions. I hope we all can handle IUPAC bases in our references! Creating a liftover file in this case would be really nice, and should take minimal effort.

If we're opening the tool to dissimilar references for the same organism, then there are some really tricky issues. What if Reference A has a frameshift relative to Reference B? What is the right way to display / report a pairwise comparison between all references together concisely (if it isn't a table)? Creating a liftover file in cases like this (e.g. hg19 -> hg38) is non-trivial.

jonn-smith avatar Mar 15 '21 21:03 jonn-smith

@kishorikonwar I had checked in the script a while ago, so that sim link just points to the version in GATK:

https://github.com/broadinstitute/gatk/blob/master/scripts/funcotator/testing/compareTwoReferenceDictionaries.sh This only does the sequence dictionary check - not the base-level check, and it's not very sophistocated.

More recently I wrote a script for doing the nucleotide diffs in python. I put it in the jts_refdiff branch here: https://github.com/broadinstitute/gatk/blob/jts_refdiff/scripts/refDiff.py

jonn-smith avatar Mar 16 '21 14:03 jonn-smith

@kishorikonwar It's worth noting that there are some instructions on how to create a chain file / liftover file from UCSC here:

http://genomewiki.ucsc.edu/index.php/LiftOver_Howto

jonn-smith avatar Mar 16 '21 16:03 jonn-smith

@jonn-smith Thank you. I will be looking at the ucsc. I also found the following tool that implements the ucsc liftover file creation....the logic seems simple. https://github.com/konstantint/pyliftover

kishorikonwar avatar Mar 16 '21 16:03 kishorikonwar

@kishorikonwar No problem. One quick note - I'm not sure that pyliftover actually creates the chain files to do the liftover. It looks like it can only retrieve and apply chain file liftovers to a given set of coordinates / VCFs (as opposed to creating the chain file which specifies the differences between references).

It's worth checking out for how to do the liftover once the chain file has been created, but to make the chain file itself I think you'll need to reference the UCSC howto or other documentation.

If we were to have this tool create chain files for liftovers then we could create chain files for arbitrary liftovers (e.g. hg19 -> CanFam3.1), which some people might find useful.

Also, another reference to take a quick look at is https://github.com/alshai/levioSAM - I haven't looked closely, but it does something similar.

jonn-smith avatar Mar 16 '21 17:03 jonn-smith

@jonn-smith @LeeTL1220 @droazen Thanks for sharing the information above, and I looked at it. It seems to me that once we have a chain file for one reference and another reference, the remaining steps are straightforward. I also noticed the following Picard utility Picard LiftoverVcf that can Lift over a VCF file from one reference to another. Therefore, creating the chain file between a pair of references (and limiting ourselves to cases where both references are from the same species, mouse/human) is the key. To that end, according to the following post List of chain file creators most of the chain file creation tools are available as a web interface. However, the UCSC one seems to be more popular, and fortunately, they have the utilities as open source and to some degree explain their steps in the LiverOver_Howto link you sent. With this approach, they first BLAT the pairwise contigs in the reference files and then use the utility DoSameSpeciesLiftOver.pl.

Based on this, it appears to me I should think about the following steps: a) First, try out their code (UCSC) and make sure it works to produce chain files for two references successfully. b) Design/propose a solution putting the logic in DoSameSpeciesLiftOver.pl into GATK, which also might need a BLAT run

Let me know what you think of this or have any suggestions about how I should proceed.

kishorikonwar avatar Mar 22 '21 16:03 kishorikonwar

As per discussion with Kishori, he is going to work on a different issue, since it is more inline with his work. He can come back to this if nobody has picked it up.

LeeTL1220 avatar Mar 24 '21 16:03 LeeTL1220

Re-assigning to @KevinCLydon

droazen avatar Mar 24 '21 17:03 droazen

Re-assigning to @orlicohen

droazen avatar Jun 29 '22 16:06 droazen