parsnp
parsnp copied to clipboard
Parsnp -d sometimes fails to recruit random files
Todd,
If I create a folder called 'fasta' with 20 small identical fasta files and run "parsnp -r '!' -d fasta" and run it, often my resulting tree only has 19 genomes in it, and other times 20. The 'missing' genome is somewhat random, and is missing from the RECRUITED GENOMES list. By running the command over and over again I get different results.
This bug has us confused, so I'm thinking it might be a non-deterministic parallel race condition maybe? Even though I'm using default -p 1.
Torsten
Torsten,
Thanks for pointing this out. Any chance you could share these 20 small files so I can take a closer look? It sounds like an issue in the MUMi distribution cutoff calc, not a race condition, but will need to debug. In the meantime, adding a -c to the command-line parameters should serve as a work-around (forcing all 20 genomes to be always included).
Hi! I have the same problem, it seems to be random as Torsten points out, and it doesn't always excludes just one genome but sometimes more as well. The fasta files have been generated the same way with the same formats and fasta headers in the format "B128_contig1". There is no error message or anything so I don't know how to describe it more in detail... Best wishes, Kaisa
hi Kaisa,
Thanks for pointing this out; it is a known issue & will be fixed/addressed in the new release (appearing shortly). In the meantime, please use the (-c) option as a workaround.
best,
Todd
Sorry I never sent you any files. I look forward tot he new version.
Hello Started using recently parsnp and also had the exact same problem either using a gbk as reference or a fasta file. Any news for the new version addressing this soon ? Cheers JAC
Thanks João,
I plan to post a new release that will address this issue. In the meantime, you could use '-c' as a workaround. Will keep you posted.
Hello,
I am getting the same exact issues even when using the -c flag. Is there a new release that we should download to work around this issue?
Thank you!
What is the status of this issue?
I've tested a couple directories multiple times and haven't been able to replicate the issue. Can someone please provide an example set?
I am copy/pasting this from our internal issue tracker as I believe it may help with this issue:
Bug was caused by Parsnp when one .fna file name was a number of repetitions of another. Example: 7.fna and 77.fna or 1.fna and 111.fna. This caused some runs of Parsnp to return a newick that was missing one or more of the genomes, leading to the visualization issues. When there were only two genomes and this issue occurred, Parsnp would fail because it could not recognise at least two files.
Issue was solved by making Parsnp retry when these errors occur. This works because the issue does not occur every time as there is a degree of randomness to Parsnps results.
@innovate-invent thanks for forwarding this to us! I was unable to replicate this issue with the following approach:
- Ran Parsnp against two files, 1.fna and 11.fna (as well as 1 vs 111, 11 vs 111 etc)
- Ran Parsnp against 1.fna, 11.fna, 111.fna and 111.fna
- Ran Parsnp against many files, which included 1.fna, 11.fna, 111.fna and 111.fna
Could you by any chance attach the output from one of the relevant runs with the --verbose flag? Particularly the runs that fail would be the most helpful. Are you selecting the reference at random? That would be my first guess for files non-deterministically being excluded.
Thanks,
Bryce
Yes, random reference selection was used. I believe this test would have to be run repeatedly as the error does not always occur. We modified our pipeline to avoid this issue. I'll have to find some time to set up another test bench for it.