ska.rust icon indicating copy to clipboard operation
ska.rust copied to clipboard

Possible column label mix-up and unexpected self-comparison distance in SKA2

Open taffners opened this issue 7 months ago • 4 comments

Hi there,

Thank you for your work on SKA2. I'm using version ska 0.4.0 and running into a couple of issues I could use some help understanding.

1. Column labels – Distance vs Mismatches

In the output from ska distance, it looks like the column headers for Distance and Mismatches might be flipped. For example: Sample1 Sample2 Distance Mismatches NYC112574_urmc_cdiff_1_1 NYC112574_urmc_cdiff_1_2 14160 0.01143

Based on previous SKA behavior and expectations, I would assume:

  • Mismatches refers to the raw integer count (e.g., 14160)
  • Distance is the normalized SNP distance as a float (e.g., 0.01143)

Is this a labeling issue, or does is it something else?

2. Unexpected distance in self-sample comparison

I’m also seeing what looks like a comparison of a sample to itself, but with a non-zero distance and mismatch count.

Originally, the sample was named NYC112574_urmc_cdiff_1 After processing, I now see NYC112574_urmc_cdiff_1_1 and NYC112574_urmc_cdiff_1_2 with the distance result above.

I expected the mismatch count and distance to be 0 if this is truly the same sample.

Am I misunderstanding how SKA2 handles paired reads or how sample IDs are generated in .skf files?

Here are the commands I'm using: ska build NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1

ska merge -o merged.skf NYC112574_urmc_cdiff_1.skf NYC112575_urmc_cdiff_1.skf NYC112577_v2_urmc_cdiff_1.skf NYC112579_urmc_cdiff_1.skf NYC112580_urmc_cdiff_1.skf

ska distance --threads 16 -o distance.tsv merged.skf

Any insight would be greatly appreciated! Please let me know if you need example input files or if this could be user error on my end.

Thanks again Samantha

taffners avatar May 22 '25 19:05 taffners

Hi Samantha,

While I'm not an expert/user of ska distance, here are some preliminary answers to your questions (John, please correct me if I'm wrong):

  1. 'Distance' refers to the SNP distance (ie, number of SNPs) so should be an integer, or float depending on the presence of ambiguous nucleotides. 'Mismatches' corresponds to the number of mismatches in the middle-base normalised by the number of split k-mers shared by the two samples, so should be a float ranging from 0 to 1. The labelling is correct. You would usually consider 'Distance' to measure the distance between two closely related isolates. But when the two isolates are very far apart phylogenetically (eg, different species), the number of SNPs might be low simply because they only share a few split k-mers. In such cases, the value 'Mismatches' is more relevant.

  2. I noticed an issue with the way you extract split k-mers:

ska build NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1

Using this command, you are actually indicating to SKA that NYC112574_xx_1.fastq.gz and NYC112574_xx_2.fastq.gz correspond to two isolates. If you wish to combine forward and reverse reads in one isolate, which I assume is the case here, you should use the command:

ska build -f list_ska.txt -o NYC112574_urmc_cdiff

with 'list_ska.txt' being a tab-delimited file indicating which files correspond to which isolates. Please see the doc (section 'read files'): https://docs.rs/ska/latest/ska/#ska-build

In your example, your skf file is composed of 1 isolate for the forward reads and another isolate for the reverse reads. In theory Distance should be equal to 0 since forward and reverse reads of a WGS run should share the same split k-mers. So I do not understand this distance of 14160.

Two additional comments: _ you may want to use a higher k-mer size (eg, ska build -k 31) _ there is perhaps no need to use 16 CPUs for ska distance as it is already quite fast

rderelle avatar May 23 '25 08:05 rderelle

  1. Yep @rderelle is right here, mismatches is the proportion of split k-mers mismatching. Distance is number of SNPs distant (which could be normalised by genome or alignment length.

If that's different from SKA1 (do you have any output you could share?) we could document that more clearly.

  1. @rderelle is correct here too, you'll need to use the -f option for read files

johnlees avatar May 27 '25 16:05 johnlees

Hello @johnlees and @rderelle,

The reason why I choose to try ska2 was because I was getting the same results with and without -S tag running ska1. Below is a subset of results for the ska1 with and without -S and ska2. The results are dramatically different. Could this be caused only by how the .skf files were made.

This is an example of how .skf files were made using ska1 ska fastq NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1

ska1: ska distance -s 15 -f skf_file_list.txt -o distance

Sample 1 Sample 2 Matches Mismatches Jaccard Index Mash-like distance SNPs SNP distance
NYC112574_urmc_cdiff_1 NYC112575_urmc_cdiff_1 1583489 5108028 0.236641 0.0309826 33547 0.0211855
NYC112574_urmc_cdiff_1 NYC112577_v2_urmc_cdiff_1 3445084 1566810 0.687382 0.00660956 9284 0.00269485
NYC112574_urmc_cdiff_1 NYC112579_urmc_cdiff_1 3478990 1488012 0.70042 0.0062517 9634 0.00276919
NYC112574_urmc_cdiff_1 NYC112572_urmc_cdiff_1 3483753 1351874 0.720435 0.00572034 9179 0.0026348
NYC112574_urmc_cdiff_1 NYC112570_urmc_cdiff_1 3403951 1423957 0.705057 0.00612671 8950 0.0026293

ska1 with singletons: ska distance -s 15 -S -f skf_file_list.txt -o distance_w_singletons

Sample 1 Sample 2 Matches Mismatches Jaccard Index Mash-like distance SNPs SNP distance
NYC112574_urmc_cdiff_1 NYC112575_urmc_cdiff_1 1583489 5108028 0.236641 0.0309826 33547 0.0211855
NYC112574_urmc_cdiff_1 NYC112577_v2_urmc_cdiff_1 3445084 1566810 0.687382 0.00660956 9284 0.00269485
NYC112574_urmc_cdiff_1 NYC112579_urmc_cdiff_1 3478990 1488012 0.70042 0.0062517 9634 0.00276919
NYC112574_urmc_cdiff_1 NYC112572_urmc_cdiff_1 3483753 1351874 0.720435 0.00572034 9179 0.0026348
NYC112574_urmc_cdiff_1 NYC112570_urmc_cdiff_1 3403951 1423957 0.705057 0.00612671 8950 0.0026293

ska2 results

Sample1 Sample2 Distance Mismatches
NYC112574_urmc_cdiff_1_1 NYC112574_urmc_cdiff_1_2 60450 0.01133
NYC112574_urmc_cdiff_1_1 NYC112575_urmc_cdiff_1_1 139407 0.61765
NYC112574_urmc_cdiff_1_1 NYC112575_urmc_cdiff_1_2 139966 0.61754
NYC112574_urmc_cdiff_1_1 NYC112577_v2_urmc_cdiff_1_1 82871 0.24589
NYC112574_urmc_cdiff_1_1 NYC112577_v2_urmc_cdiff_1_2 82918 0.24837
NYC112574_urmc_cdiff_1_1 NYC112579_urmc_cdiff_1_1 81743 0.22569
NYC112574_urmc_cdiff_1_1 NYC112579_urmc_cdiff_1_2 81428 0.22666
NYC112574_urmc_cdiff_1_1 NYC112572_urmc_cdiff_1_1 79651 0.20944
NYC112574_urmc_cdiff_1_1 NYC112572_urmc_cdiff_1_2 80207 0.21024
NYC112574_urmc_cdiff_1_1 NYC112570_urmc_cdiff_1_1 75877 0.23344
NYC112574_urmc_cdiff_1_1 NYC112570_urmc_cdiff_1_2 78385 0.22058
NYC112574_urmc_cdiff_1_2 NYC112575_urmc_cdiff_1_1 139308 0.61775
NYC112574_urmc_cdiff_1_2 NYC112575_urmc_cdiff_1_2 139983 0.61733
NYC112574_urmc_cdiff_1_2 NYC112577_v2_urmc_cdiff_1_1 82868 0.24616
NYC112574_urmc_cdiff_1_2 NYC112577_v2_urmc_cdiff_1_2 82958 0.24843
NYC112574_urmc_cdiff_1_2 NYC112579_urmc_cdiff_1_1 81756 0.2259
NYC112574_urmc_cdiff_1_2 NYC112579_urmc_cdiff_1_2 81472 0.22636
NYC112574_urmc_cdiff_1_2 NYC112572_urmc_cdiff_1_1 79598 0.20979
NYC112574_urmc_cdiff_1_2 NYC112572_urmc_cdiff_1_2 80213 0.21011
NYC112574_urmc_cdiff_1_2 NYC112570_urmc_cdiff_1_1 75765 0.23365
NYC112574_urmc_cdiff_1_2 NYC112570_urmc_cdiff_1_2 78277 0.22052

taffners avatar May 27 '25 18:05 taffners

You'll definitely need to fix the reads issue before a comparison can be made.

In each of your ska fastq commands:

ska fastq NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1

This basically becomes a row of the file list for the -f option:

NYC112574_urmc_cdiff_1 NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_2.fastq.gz

johnlees avatar May 28 '25 09:05 johnlees