referenceseeker icon indicating copy to clipboard operation
referenceseeker copied to clipboard

Finding a suitable reference for a set of genomes

Open MostafaYA opened this issue 3 years ago • 3 comments

Hello, thanks for this great tool. Just a question: I wonder how to select the appropriate reference for a set of (diverse) genomes.
When I run the referenceseeker in this case, it gives different reference for each genome.

MostafaYA avatar May 02 '22 12:05 MostafaYA

Hi @MostafaYA, thanks for this excellent question! This is indeed an interesting use case and we already started to work on a solution for that. However, this will still take a while. Maybe we can provide a solution for that at the end of this year .

oschwengers avatar May 02 '22 20:05 oschwengers

@oschwengers any update on that work? I'm wondering what the best approach would be here? Two passes, the first that finds all candidates for all samples and the second that computes distance to each of these candidates and finds the one with the lowest average distance?

pvanheus avatar Jan 15 '24 06:01 pvanheus

Thanks @pvanheus for bringing this up again. Actually, this just slipped down my priority list. But if there is still a need for and interest in that, I would try to work on this as a side-side project. Unfortunately, I cannot make any reliable commitments to this right now.

Regarding the WF: right as you mentioned: First we have to calculate approx. genome distances (for instance Mash) as a rough estimate to select reference candidates. Then we have to compute ANI between all query and reference candidates and then rank & select these references. The main task we tried to work on is how to best rank the reference genomes as ANI difference of course can differ a lot between a reference and the given query genomes. How to handle harsh outliers for example? As a simple approach we played around with classic arithmetic/geometric/harmonic means....

oschwengers avatar Feb 01 '24 09:02 oschwengers