pandora icon indicating copy to clipboard operation
pandora copied to clipboard

How to improve performance for Illumina-based analysis of viral pangenome

Open dandaman opened this issue 3 years ago • 12 comments
trafficstars

Hi,

I'd like to use pandora to study a viral pangenome on the basis of Illumina data. All together I'm looking at 13 input MSAs. Each comprise ~300 high-quality reference sequences. I'd like to study new samples using a pangenome graph.

As I am also looking at gramtools in parallel and wanted to assess the performance of both, I used two Sanger sequenced strains: one as reference and the other to simulate an Illumina sample at 100x and 300x coverage.

As its a single sample I've used pandora map --vcf-refs filename --kg --loci-vcf -M -I --clean --genotype

Following the suggested workflow (make_prg=0.1.1/bioconda with max_nesting: 5 and min_match_length: 7; pandora=0.9.1/bioconda) I was surprised to see that the returned personalised reference/pandora.consensus.fq.gz diverges substantially from the Sanger ref. The primary "alleles" of each prg diverges up to 2 in edit distance in 7 of the 13 prgs. 4 of the prgs even generate secondary "alleles" (e.g. prg_name.12) with much higher edit distances (15-54).

Is this to be expected? Or am I doing something wrong? If not what can I do to improve the performance?

Best, Daniel

dandaman avatar Feb 23 '22 12:02 dandaman

This is super interesting. Would it be acceptable to share the data so we could take a look at what is going on? Could do via email if you prefer not on github.

iqbal-lab avatar Feb 23 '22 12:02 iqbal-lab

Thanks for your super-fast response! Of course, gladly - where can I find your email address?

dandaman avatar Feb 23 '22 12:02 dandaman

Hello @dandaman ,

could you send it to leandro [at] ebi [dot] ac [dot] uk? I will debug the execution with your data and try to understand it. If the data is too large to be sent through mail, please tell me that I will provide you with a link to upload it.

Cheers

leoisl avatar Feb 23 '22 12:02 leoisl

Hi @leoisl , did you receive my email? Best, Daniel

dandaman avatar Mar 09 '22 20:03 dandaman

Hey @dandaman ,

yes, I did receive it and just replied! Sorry for the delay, I did not manage to access my mail during the day as I was focusing on finishing a PR!

Cheers

leoisl avatar Mar 10 '22 00:03 leoisl

Hi @leoisl, did you have a chance to look at the data I send yet? Best, Daniel

dandaman avatar Mar 22 '22 08:03 dandaman

Hello @dandaman ,

Yes, but I was just able to take a quick look. I had to switch priorities to an urgent task from the last week, but will be able to take a detailed look by the end of next week. Sorry for the delay :(

Cheers

leoisl avatar Mar 22 '22 09:03 leoisl

Hi @leoisl ,

did you have time yet to look into this issue?

Best, Daniel

dandaman avatar Jun 01 '22 08:06 dandaman

Hi @dandaman , can't comment on the pandora side, but was wondering if you were able to run gramtools and if so how close its personalised ref was to your Sanger ref?

bricoletc avatar Jun 01 '22 12:06 bricoletc

Hi @bricoletc,

yes I've used gramtools as well in the same simulation experiment and it worked perfectly! I'd have to look up the details, but if I remember correctly it was 100% id to the simulated reference :-)

So for the time being with virues I've continued with gramtools only. But I am eager to work with pandora as well as I'd like to apply this to eukaryotic genomes as well. I'm not sure gramtools would scale to that natively. Or have you experience with that?

Best, Daniel

dandaman avatar Jun 01 '22 14:06 dandaman

Good to hear! gramtools doesn't scale well to large eukaryotic genomes (e.g. human). I use it in P. falciparum, small eukaryotic genome, 23Mbp, and that's fine (e.g. 1-5 hours runtime) however i'm not sure how it would fare on some more intermediate-size genomes, e.g. on the order of 100s of Mbp.

Also can't comment on size scaling for pandora though I think it's mostly been tested on bacterial genomes on the order of 1-10Mbp (@leoisl )

bricoletc avatar Jun 01 '22 15:06 bricoletc

Pandora has been extensively tested on E. coli, and some big plasmid databases, but IDK how much this second one would amount to...

Very sorry for my lack of updates @dandaman , the "quick" 2-week high-priority task I had to do became a 1-month task, and right after it I got allocated to another one that is still ongoing. Will take a look at this this week though, because otherwise I will have to postpone it even further, but I don't think I should postpone more.

Cheers and thanks for the reminder

leoisl avatar Jun 01 '22 15:06 leoisl