fastv
fastv copied to clipboard
Strange results
Hello,
Thanks for this amazing tool. I am using fastv in perhaps an unusual way. I'm looking to detect the presence or absence of homologous gene clusters in metagenomic data. I started with ~17 million ORFs from our contigs, and clustered (mmseqs2, 95% identity) them into almost 4 million clusters. I want to detecte the clusters by detecting one or more unique kmers from their representative sequences.
I used unique_kmer
(initially) to identify unique 24mers, but this took a long time, generated millions of files and unique 24-mers could not be found for a majority of sequences. I fell back on jellyfish, 32mers, and a convoluted pipeline of aligning the kmers against the cluster representatives and then filtering the sorted SAM file to include only non-overlapping kmers. This way the vast majority of cluster representatives had one or more unique kmers (mean of 3 and up to hundreds).
I applied fastv
with minimal filtering and lowest thresholds (-A -G -Q -L -p 0.001 -d 0.001
), but only ~100k of ~3.4 million cluster representatives are ever identified across all >200 samples.
I tested further by extracting only unique kmers from one sample and testing them against the sample reads: no hits!
Yet, when I search with seqkit
, I find that the sequence file does indeed contain this kmer three times: "ATGAAATTCCATGGAATGGAATGGAATGGAAA"
e.g.
@K00150:405:HGVM5BBXY:6:2119:16741:38381/1
ATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGA*ATGAAATTCCATGGAATGGAATGGAATGGAAA*AGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGAAATTCC
+
AAAFFJJJJJFAJJJJFJAFAAFJJFAFFJJAJJFF-FJAJ<JFJFJFFJFJJJFFAJJJFFJJJJJJJJFFJFFAAFJJFAFJFA-FJJFFJJJJFFFJFFJJJJFJFJJFJJJJFJJJF7FJFFFJJJF7FJJAAAFJF7AFJJFFAF
Can you advise why fastv
seems not to be detecting it?
Thanks!
Just to add as a test case, I have provided a fastq sequence file with three reads
@K00150:405:HGVM5BBXY:6:2119:16741:38381/1
ATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAAAAGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGAAATTCC
+
AAAFFJJJJJFAJJJJFJAFAAFJJFAFFJJAJJFF-FJAJ<JFJFJFFJFJJJFFAJJJFFJJJJJJJJFFJFFAAFJJFAFJFA-FJJFFJJJJFFFJFFJJJJFJFJJFJJJJFJJJF7FJFFFJJJF7FJJAAAFJF7AFJJFFAF
@K00150:414:HK5NLBBXY:1:1121:16457:6976/1
AATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAAAAGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGAAATTCCATGGAATGGAATGG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJFJJJJJJJJFJJJJJAFJJJFFF7<AJJJJFJFJJFJJJJJJJJJJJJJFJJFJJJJJFFFFAJFJJJJJJFJFFJJJF7---77AF<JFJJ-7-
@K00150:414:HK5NLBBXY:2:2212:24393:37396/1
AATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAAAAGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGAAATTCCACGGAATGGAATGG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJ
And a kmer database of:
>test
ATGAAATTCCATGGAATGGAATGGAATGGAAA
or
>test
ATGAAATTCCATGGAATGGAA
Neither the 21-mer or 32-mer was found in the sequences despite their presence.
21-mer example:
{
"kmer_collection_scan_result": {
},
"summary": {
"before_filtering": {
"total_reads":3,
"total_bases":450,
"q20_bases":443,
"q30_bases":432,
"q20_rate":0.984444,
"q30_rate":0.96,
"read1_mean_length":150,
"gc_content":0.373333
},
"after_filtering": {
"total_reads":3,
"total_bases":450,
"q20_bases":443,
"q30_bases":432,
"q20_rate":0.984444,
"q30_rate":0.96,
"read1_mean_length":150,
"gc_content":0.373333
}
},
"filtering_result": {
"passed_filter_reads": 3,
"low_quality_reads": 0,
"too_many_N_reads": 0,
"too_short_reads": 0,
"too_long_reads": 0
},
"duplication": {
"rate": 0,
"histogram": [2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
"mean_gc": [0.264706,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
},
"read1_before_filtering": {
"total_reads": 3,
"total_bases": 450,
"q20_bases": 443,
"q30_bases": 432,
"total_cycles": 150,
"quality_curves": {
... },
"content_curves": {
...
}
},
"read1_after_filtering": {
"total_reads": 3,
"total_bases": 450,
"q20_bases": 443,
"q30_bases": 432,
"total_cycles": 150,
"quality_curves": {
...
},
"content_curves": {
....
}
},
"command": "fastv -i tmp_test_targets.fastq -c tmp_test_kmers.fasta -A -G -V -Q -L -j tmp_test_fastv.json -h tmp_test_fastv.html -p 0.001 -d 0.001 "
I must misunderstand something important about how fastv is operating, or there is an error. I will be grateful for any advice.