Is there an issue with ska distance?
Thanks for this great tool!! I am migrating my pipelines to use ska2 and have noticed that there seems to be some discordance between the SNP distances reported from ska v1 and ska v2 - when used on the same inputs - the SNP distances reported for v2 are much greater (at least 10x).
I noted whilst reading the documentation at https://docs.rs/ska/latest/ska/#ska-distance it says that the default --min-freq value is 0.9. However, when I checked the documentation on the command line it seems that the default value is 0 NOT 0.9.
Calculate SNP distances and k-mer mismatches
Usage: ska distance [OPTIONS] <SKF_FILE>
Arguments:
<SKF_FILE> Split-kmer (.skf) file to operate on
Options:
-o <OUTPUT> Output filename (omit to output to stdout)
-m, --min-freq <MIN_FREQ> Minimum fraction of samples a k-mer has to appear in [default: 0] <---**
--allow-ambiguous Filter out ambiguous bases ('N' still a mismatch)
--threads <THREADS> Number of CPU threads [default: 1]
-v, --verbose Show progress messages
-h, --help Print help
-V, --version Print version
Looking at the code - I looks like the DEFAULT_MINFREQ perhaps not set in the Distance Commands of cli.rs script? Could this be the cause of the observation?
I am going to set the --min-freq value moving forward to ensure that there is greater consistency - but any further help would be great!!
Thanks for noting this. We have also had reports on distance changes in #92, #81 and #69 which may be useful to look at.
Differences are due to different defaults from ska1. I thought we'd made this more compatible but it does seem like there's some inconsistency left around the min-freq parameter. Have you tried setting it as both 0.9 and 0 to see how your results change?
Can I also check what version you are running, what command you ran, and what the expected (ska1) and returned (ska2) number of SNPs was?
See #97 and #98 -- this should now be fixed in v0.4.1