ska.rust Possible column label mix-up and unexpected self-comparison distance in SKA2

Hi there,

Thank you for your work on SKA2. I'm using version ska 0.4.0 and running into a couple of issues I could use some help understanding.

1. Column labels – Distance vs Mismatches

In the output from ska distance, it looks like the column headers for Distance and Mismatches might be flipped. For example: Sample1 Sample2 Distance Mismatches NYC112574_urmc_cdiff_1_1 NYC112574_urmc_cdiff_1_2 14160 0.01143

Based on previous SKA behavior and expectations, I would assume:

Mismatches refers to the raw integer count (e.g., 14160)
Distance is the normalized SNP distance as a float (e.g., 0.01143)

Is this a labeling issue, or does is it something else?

2. Unexpected distance in self-sample comparison

I’m also seeing what looks like a comparison of a sample to itself, but with a non-zero distance and mismatch count.

Originally, the sample was named NYC112574_urmc_cdiff_1 After processing, I now see NYC112574_urmc_cdiff_1_1 and NYC112574_urmc_cdiff_1_2 with the distance result above.

I expected the mismatch count and distance to be 0 if this is truly the same sample.

Am I misunderstanding how SKA2 handles paired reads or how sample IDs are generated in .skf files?

Here are the commands I'm using: ska build NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1

ska merge -o merged.skf NYC112574_urmc_cdiff_1.skf NYC112575_urmc_cdiff_1.skf NYC112577_v2_urmc_cdiff_1.skf NYC112579_urmc_cdiff_1.skf NYC112580_urmc_cdiff_1.skf

ska distance --threads 16 -o distance.tsv merged.skf

Any insight would be greatly appreciated! Please let me know if you need example input files or if this could be user error on my end.

Thanks again Samantha

May 22 '25 19:05 taffners

Hi Samantha,

While I'm not an expert/user of ska distance, here are some preliminary answers to your questions (John, please correct me if I'm wrong):

'Distance' refers to the SNP distance (ie, number of SNPs) so should be an integer, or float depending on the presence of ambiguous nucleotides. 'Mismatches' corresponds to the number of mismatches in the middle-base normalised by the number of split k-mers shared by the two samples, so should be a float ranging from 0 to 1. The labelling is correct. You would usually consider 'Distance' to measure the distance between two closely related isolates. But when the two isolates are very far apart phylogenetically (eg, different species), the number of SNPs might be low simply because they only share a few split k-mers. In such cases, the value 'Mismatches' is more relevant.
I noticed an issue with the way you extract split k-mers:

ska build NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1

Using this command, you are actually indicating to SKA that NYC112574_xx_1.fastq.gz and NYC112574_xx_2.fastq.gz correspond to two isolates. If you wish to combine forward and reverse reads in one isolate, which I assume is the case here, you should use the command:

ska build -f list_ska.txt -o NYC112574_urmc_cdiff

with 'list_ska.txt' being a tab-delimited file indicating which files correspond to which isolates. Please see the doc (section 'read files'): https://docs.rs/ska/latest/ska/#ska-build

In your example, your skf file is composed of 1 isolate for the forward reads and another isolate for the reverse reads. In theory Distance should be equal to 0 since forward and reverse reads of a WGS run should share the same split k-mers. So I do not understand this distance of 14160.

Two additional comments: _ you may want to use a higher k-mer size (eg, ska build -k 31) _ there is perhaps no need to use 16 CPUs for ska distance as it is already quite fast

May 23 '25 08:05 rderelle

Yep @rderelle is right here, mismatches is the proportion of split k-mers mismatching. Distance is number of SNPs distant (which could be normalised by genome or alignment length.

If that's different from SKA1 (do you have any output you could share?) we could document that more clearly.

@rderelle is correct here too, you'll need to use the -f option for read files

May 27 '25 16:05 johnlees

Hello @johnlees and @rderelle,

The reason why I choose to try ska2 was because I was getting the same results with and without -S tag running ska1. Below is a subset of results for the ska1 with and without -S and ska2. The results are dramatically different. Could this be caused only by how the .skf files were made.

This is an example of how .skf files were made using ska1 ska fastq NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1

ska1: ska distance -s 15 -f skf_file_list.txt -o distance

Sample 1	Sample 2	Matches	Mismatches	Jaccard Index	Mash-like distance	SNPs	SNP distance
NYC112574_urmc_cdiff_1	NYC112575_urmc_cdiff_1	1583489	5108028	0.236641	0.0309826	33547	0.0211855
NYC112574_urmc_cdiff_1	NYC112577_v2_urmc_cdiff_1	3445084	1566810	0.687382	0.00660956	9284	0.00269485
NYC112574_urmc_cdiff_1	NYC112579_urmc_cdiff_1	3478990	1488012	0.70042	0.0062517	9634	0.00276919
NYC112574_urmc_cdiff_1	NYC112572_urmc_cdiff_1	3483753	1351874	0.720435	0.00572034	9179	0.0026348
NYC112574_urmc_cdiff_1	NYC112570_urmc_cdiff_1	3403951	1423957	0.705057	0.00612671	8950	0.0026293

ska1 with singletons: ska distance -s 15 -S -f skf_file_list.txt -o distance_w_singletons

Sample 1	Sample 2	Matches	Mismatches	Jaccard Index	Mash-like distance	SNPs	SNP distance
NYC112574_urmc_cdiff_1	NYC112575_urmc_cdiff_1	1583489	5108028	0.236641	0.0309826	33547	0.0211855
NYC112574_urmc_cdiff_1	NYC112577_v2_urmc_cdiff_1	3445084	1566810	0.687382	0.00660956	9284	0.00269485
NYC112574_urmc_cdiff_1	NYC112579_urmc_cdiff_1	3478990	1488012	0.70042	0.0062517	9634	0.00276919
NYC112574_urmc_cdiff_1	NYC112572_urmc_cdiff_1	3483753	1351874	0.720435	0.00572034	9179	0.0026348
NYC112574_urmc_cdiff_1	NYC112570_urmc_cdiff_1	3403951	1423957	0.705057	0.00612671	8950	0.0026293

ska2 results

Sample1	Sample2	Distance	Mismatches
NYC112574_urmc_cdiff_1_1	NYC112574_urmc_cdiff_1_2	60450	0.01133
NYC112574_urmc_cdiff_1_1	NYC112575_urmc_cdiff_1_1	139407	0.61765
NYC112574_urmc_cdiff_1_1	NYC112575_urmc_cdiff_1_2	139966	0.61754
NYC112574_urmc_cdiff_1_1	NYC112577_v2_urmc_cdiff_1_1	82871	0.24589
NYC112574_urmc_cdiff_1_1	NYC112577_v2_urmc_cdiff_1_2	82918	0.24837
NYC112574_urmc_cdiff_1_1	NYC112579_urmc_cdiff_1_1	81743	0.22569
NYC112574_urmc_cdiff_1_1	NYC112579_urmc_cdiff_1_2	81428	0.22666
NYC112574_urmc_cdiff_1_1	NYC112572_urmc_cdiff_1_1	79651	0.20944
NYC112574_urmc_cdiff_1_1	NYC112572_urmc_cdiff_1_2	80207	0.21024
NYC112574_urmc_cdiff_1_1	NYC112570_urmc_cdiff_1_1	75877	0.23344
NYC112574_urmc_cdiff_1_1	NYC112570_urmc_cdiff_1_2	78385	0.22058
NYC112574_urmc_cdiff_1_2	NYC112575_urmc_cdiff_1_1	139308	0.61775
NYC112574_urmc_cdiff_1_2	NYC112575_urmc_cdiff_1_2	139983	0.61733
NYC112574_urmc_cdiff_1_2	NYC112577_v2_urmc_cdiff_1_1	82868	0.24616
NYC112574_urmc_cdiff_1_2	NYC112577_v2_urmc_cdiff_1_2	82958	0.24843
NYC112574_urmc_cdiff_1_2	NYC112579_urmc_cdiff_1_1	81756	0.2259
NYC112574_urmc_cdiff_1_2	NYC112579_urmc_cdiff_1_2	81472	0.22636
NYC112574_urmc_cdiff_1_2	NYC112572_urmc_cdiff_1_1	79598	0.20979
NYC112574_urmc_cdiff_1_2	NYC112572_urmc_cdiff_1_2	80213	0.21011
NYC112574_urmc_cdiff_1_2	NYC112570_urmc_cdiff_1_1	75765	0.23365
NYC112574_urmc_cdiff_1_2	NYC112570_urmc_cdiff_1_2	78277	0.22052

May 27 '25 18:05 taffners

You'll definitely need to fix the reads issue before a comparison can be made.

In each of your ska fastq commands:

ska fastq NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1

This basically becomes a row of the file list for the -f option:

NYC112574_urmc_cdiff_1 NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_2.fastq.gz

May 28 '25 09:05 johnlees