Possible column label mix-up and unexpected self-comparison distance in SKA2
Hi there,
Thank you for your work on SKA2. I'm using version ska 0.4.0 and running into a couple of issues I could use some help understanding.
1. Column labels – Distance vs Mismatches
In the output from ska distance, it looks like the column headers for Distance and Mismatches might be flipped. For example: Sample1 Sample2 Distance Mismatches NYC112574_urmc_cdiff_1_1 NYC112574_urmc_cdiff_1_2 14160 0.01143
Based on previous SKA behavior and expectations, I would assume:
- Mismatches refers to the raw integer count (e.g., 14160)
- Distance is the normalized SNP distance as a float (e.g., 0.01143)
Is this a labeling issue, or does is it something else?
2. Unexpected distance in self-sample comparison
I’m also seeing what looks like a comparison of a sample to itself, but with a non-zero distance and mismatch count.
Originally, the sample was named NYC112574_urmc_cdiff_1 After processing, I now see NYC112574_urmc_cdiff_1_1 and NYC112574_urmc_cdiff_1_2 with the distance result above.
I expected the mismatch count and distance to be 0 if this is truly the same sample.
Am I misunderstanding how SKA2 handles paired reads or how sample IDs are generated in .skf files?
Here are the commands I'm using: ska build NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1
ska merge -o merged.skf NYC112574_urmc_cdiff_1.skf NYC112575_urmc_cdiff_1.skf NYC112577_v2_urmc_cdiff_1.skf NYC112579_urmc_cdiff_1.skf NYC112580_urmc_cdiff_1.skf
ska distance --threads 16 -o distance.tsv merged.skf
Any insight would be greatly appreciated! Please let me know if you need example input files or if this could be user error on my end.
Thanks again Samantha
Hi Samantha,
While I'm not an expert/user of ska distance, here are some preliminary answers to your questions (John, please correct me if I'm wrong):
-
'Distance' refers to the SNP distance (ie, number of SNPs) so should be an integer, or float depending on the presence of ambiguous nucleotides. 'Mismatches' corresponds to the number of mismatches in the middle-base normalised by the number of split k-mers shared by the two samples, so should be a float ranging from 0 to 1. The labelling is correct. You would usually consider 'Distance' to measure the distance between two closely related isolates. But when the two isolates are very far apart phylogenetically (eg, different species), the number of SNPs might be low simply because they only share a few split k-mers. In such cases, the value 'Mismatches' is more relevant.
-
I noticed an issue with the way you extract split k-mers:
ska build NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1
Using this command, you are actually indicating to SKA that NYC112574_xx_1.fastq.gz and NYC112574_xx_2.fastq.gz correspond to two isolates. If you wish to combine forward and reverse reads in one isolate, which I assume is the case here, you should use the command:
ska build -f list_ska.txt -o NYC112574_urmc_cdiff
with 'list_ska.txt' being a tab-delimited file indicating which files correspond to which isolates. Please see the doc (section 'read files'): https://docs.rs/ska/latest/ska/#ska-build
In your example, your skf file is composed of 1 isolate for the forward reads and another isolate for the reverse reads. In theory Distance should be equal to 0 since forward and reverse reads of a WGS run should share the same split k-mers. So I do not understand this distance of 14160.
Two additional comments: _ you may want to use a higher k-mer size (eg, ska build -k 31) _ there is perhaps no need to use 16 CPUs for ska distance as it is already quite fast
- Yep @rderelle is right here, mismatches is the proportion of split k-mers mismatching. Distance is number of SNPs distant (which could be normalised by genome or alignment length.
If that's different from SKA1 (do you have any output you could share?) we could document that more clearly.
- @rderelle is correct here too, you'll need to use the
-foption for read files
Hello @johnlees and @rderelle,
The reason why I choose to try ska2 was because I was getting the same results with and without -S tag running ska1. Below is a subset of results for the ska1 with and without -S and ska2. The results are dramatically different. Could this be caused only by how the .skf files were made.
This is an example of how .skf files were made using ska1 ska fastq NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1
ska1: ska distance -s 15 -f skf_file_list.txt -o distance
| Sample 1 | Sample 2 | Matches | Mismatches | Jaccard Index | Mash-like distance | SNPs | SNP distance |
|---|---|---|---|---|---|---|---|
| NYC112574_urmc_cdiff_1 | NYC112575_urmc_cdiff_1 | 1583489 | 5108028 | 0.236641 | 0.0309826 | 33547 | 0.0211855 |
| NYC112574_urmc_cdiff_1 | NYC112577_v2_urmc_cdiff_1 | 3445084 | 1566810 | 0.687382 | 0.00660956 | 9284 | 0.00269485 |
| NYC112574_urmc_cdiff_1 | NYC112579_urmc_cdiff_1 | 3478990 | 1488012 | 0.70042 | 0.0062517 | 9634 | 0.00276919 |
| NYC112574_urmc_cdiff_1 | NYC112572_urmc_cdiff_1 | 3483753 | 1351874 | 0.720435 | 0.00572034 | 9179 | 0.0026348 |
| NYC112574_urmc_cdiff_1 | NYC112570_urmc_cdiff_1 | 3403951 | 1423957 | 0.705057 | 0.00612671 | 8950 | 0.0026293 |
ska1 with singletons: ska distance -s 15 -S -f skf_file_list.txt -o distance_w_singletons
| Sample 1 | Sample 2 | Matches | Mismatches | Jaccard Index | Mash-like distance | SNPs | SNP distance |
|---|---|---|---|---|---|---|---|
| NYC112574_urmc_cdiff_1 | NYC112575_urmc_cdiff_1 | 1583489 | 5108028 | 0.236641 | 0.0309826 | 33547 | 0.0211855 |
| NYC112574_urmc_cdiff_1 | NYC112577_v2_urmc_cdiff_1 | 3445084 | 1566810 | 0.687382 | 0.00660956 | 9284 | 0.00269485 |
| NYC112574_urmc_cdiff_1 | NYC112579_urmc_cdiff_1 | 3478990 | 1488012 | 0.70042 | 0.0062517 | 9634 | 0.00276919 |
| NYC112574_urmc_cdiff_1 | NYC112572_urmc_cdiff_1 | 3483753 | 1351874 | 0.720435 | 0.00572034 | 9179 | 0.0026348 |
| NYC112574_urmc_cdiff_1 | NYC112570_urmc_cdiff_1 | 3403951 | 1423957 | 0.705057 | 0.00612671 | 8950 | 0.0026293 |
ska2 results
| Sample1 | Sample2 | Distance | Mismatches |
|---|---|---|---|
| NYC112574_urmc_cdiff_1_1 | NYC112574_urmc_cdiff_1_2 | 60450 | 0.01133 |
| NYC112574_urmc_cdiff_1_1 | NYC112575_urmc_cdiff_1_1 | 139407 | 0.61765 |
| NYC112574_urmc_cdiff_1_1 | NYC112575_urmc_cdiff_1_2 | 139966 | 0.61754 |
| NYC112574_urmc_cdiff_1_1 | NYC112577_v2_urmc_cdiff_1_1 | 82871 | 0.24589 |
| NYC112574_urmc_cdiff_1_1 | NYC112577_v2_urmc_cdiff_1_2 | 82918 | 0.24837 |
| NYC112574_urmc_cdiff_1_1 | NYC112579_urmc_cdiff_1_1 | 81743 | 0.22569 |
| NYC112574_urmc_cdiff_1_1 | NYC112579_urmc_cdiff_1_2 | 81428 | 0.22666 |
| NYC112574_urmc_cdiff_1_1 | NYC112572_urmc_cdiff_1_1 | 79651 | 0.20944 |
| NYC112574_urmc_cdiff_1_1 | NYC112572_urmc_cdiff_1_2 | 80207 | 0.21024 |
| NYC112574_urmc_cdiff_1_1 | NYC112570_urmc_cdiff_1_1 | 75877 | 0.23344 |
| NYC112574_urmc_cdiff_1_1 | NYC112570_urmc_cdiff_1_2 | 78385 | 0.22058 |
| NYC112574_urmc_cdiff_1_2 | NYC112575_urmc_cdiff_1_1 | 139308 | 0.61775 |
| NYC112574_urmc_cdiff_1_2 | NYC112575_urmc_cdiff_1_2 | 139983 | 0.61733 |
| NYC112574_urmc_cdiff_1_2 | NYC112577_v2_urmc_cdiff_1_1 | 82868 | 0.24616 |
| NYC112574_urmc_cdiff_1_2 | NYC112577_v2_urmc_cdiff_1_2 | 82958 | 0.24843 |
| NYC112574_urmc_cdiff_1_2 | NYC112579_urmc_cdiff_1_1 | 81756 | 0.2259 |
| NYC112574_urmc_cdiff_1_2 | NYC112579_urmc_cdiff_1_2 | 81472 | 0.22636 |
| NYC112574_urmc_cdiff_1_2 | NYC112572_urmc_cdiff_1_1 | 79598 | 0.20979 |
| NYC112574_urmc_cdiff_1_2 | NYC112572_urmc_cdiff_1_2 | 80213 | 0.21011 |
| NYC112574_urmc_cdiff_1_2 | NYC112570_urmc_cdiff_1_1 | 75765 | 0.23365 |
| NYC112574_urmc_cdiff_1_2 | NYC112570_urmc_cdiff_1_2 | 78277 | 0.22052 |
You'll definitely need to fix the reads issue before a comparison can be made.
In each of your ska fastq commands:
ska fastq NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_2.fastq.gz -o NYC112574_urmc_cdiff_1
This basically becomes a row of the file list for the -f option:
NYC112574_urmc_cdiff_1 NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_1.fastq.gz NYC112574_urmc_cdiff_1/NYC112574_urmc_cdiff_1_2.fastq.gz