Automatic selection of core theshold
I wonder if it would be possible to automatically choose an "optimal" core threshold cutoff?
I am not 100% sure what metric we want to optimize, but i feel the distribution of %aligned for each sequence could be informative. ie. a freequency vs %aligned graph. Say most sequences have 10% missing data then the distribution will be "peaky" around 90% and have some samples higher, and others lower tailing off.
My feeling is the threshold should be "somewhere" to the left of the mode of the distribution?
Alternatively, a mode to assess and remove samples below a target threshold could also be useful (and maybe exists as a tool already?)
Thanks - this has given me lots to think about 🤔
Automatically choosing a threshold
For finding an 'optimal' threshold, I suspect that a simple rule (e.g. 95%) might work just as well as tailoring it to the dataset, and it would certainly be simpler. However, if you want to experiment, here's a big one-liner to calculate the mean, median and mode of per-sample data completeness:
seqtk comp core.full.aln | awk '{print ($3+$4+$5+$6)/$2;}' | python3 -c "import sys, numpy, scipy; d = numpy.array([float(x) for x in sys.stdin]); print(f'Mean: {d.mean():.6f}\nMedian: {numpy.median(d):.6f}\nMode: {scipy.stats.mode(d, keepdims=False)[0]:.6f}')"
It uses seqtk to count ACGT characters, awk for completeness calculations and then Python/NumPy/SciPy for averages. Running it on the big 10k Typhi alignment I used in this benchmark, I got:
Mean: 0.967481
Median: 0.968821
Mode: 0.969270
How those values guide the 'best' core threshold is another question entirely - but it's a starting point for testing!
Removing samples with too much missing data
For removing samples with too much missing data, I couldn't find a tool that does this in one step. goalign clean seqs comes close but doesn't group all non-ACGT characters (a feature only implemented for goalign clean sites in this commit).
Here's a workaround that @mtaouk helped me with:
seqtk comp core.full.aln | awk '{if (($3+$4+$5+$6)/$2 >= 0.9) print $1;}' | seqkit grep -f - core.full.aln > filtered.aln
It uses seqtk for ACGT counts, awk to filter sequences above a threshold (90% in this example) and seqkit to extract passing sequences. I haven't thoroughly tested this, so proceed with caution. And if anyone knows of a simpler way to handle per-sample filtering, please share!
I'm hesitant to integrate this into Core-SNP-filter since filtering samples and sites can have interdependent effects (e.g. filtering samples can change which sites pass the core threshold and vice versa). So I'd prefer to keep this as a separate step, like the command above.