ska.rust icon indicating copy to clipboard operation
ska.rust copied to clipboard

Is there an issue with ska distance?

Open kristyhoran opened this issue 8 months ago • 1 comments

Thanks for this great tool!! I am migrating my pipelines to use ska2 and have noticed that there seems to be some discordance between the SNP distances reported from ska v1 and ska v2 - when used on the same inputs - the SNP distances reported for v2 are much greater (at least 10x).

I noted whilst reading the documentation at https://docs.rs/ska/latest/ska/#ska-distance it says that the default --min-freq value is 0.9. However, when I checked the documentation on the command line it seems that the default value is 0 NOT 0.9.

Calculate SNP distances and k-mer mismatches

Usage: ska distance [OPTIONS] <SKF_FILE>

Arguments:
  <SKF_FILE>  Split-kmer (.skf) file to operate on

Options:
  -o <OUTPUT>                Output filename (omit to output to stdout)
  -m, --min-freq <MIN_FREQ>  Minimum fraction of samples a k-mer has to appear in [default: 0] <---**
      --allow-ambiguous      Filter out ambiguous bases ('N' still a mismatch)
      --threads <THREADS>    Number of CPU threads [default: 1]
  -v, --verbose              Show progress messages
  -h, --help                 Print help
  -V, --version              Print version

Looking at the code - I looks like the DEFAULT_MINFREQ perhaps not set in the Distance Commands of cli.rs script? Could this be the cause of the observation?

I am going to set the --min-freq value moving forward to ensure that there is greater consistency - but any further help would be great!!

kristyhoran avatar May 01 '25 06:05 kristyhoran

Thanks for noting this. We have also had reports on distance changes in #92, #81 and #69 which may be useful to look at.

Differences are due to different defaults from ska1. I thought we'd made this more compatible but it does seem like there's some inconsistency left around the min-freq parameter. Have you tried setting it as both 0.9 and 0 to see how your results change?

Can I also check what version you are running, what command you ran, and what the expected (ska1) and returned (ska2) number of SNPs was?

johnlees avatar May 01 '25 08:05 johnlees

See #97 and #98 -- this should now be fixed in v0.4.1

johnlees avatar Jul 24 '25 08:07 johnlees