srst2 icon indicating copy to clipboard operation
srst2 copied to clipboard

custom databases - manual intervention required to complete code for sequence reads

Open pjbiggs opened this issue 1 year ago • 0 comments

Hi there,

I am using SRST2 for a custom database to search for a small variable gene region (~320bp with flanking) within a set of Campylobacter sp genomes. i have made a small database of unique sequences from a much larger sequence dataset using the provided instructions. This small dataset has 100 sequences, and clusters to 5 sequences at c = 0.9 within cdhit-est. i have made the sequence names as simple as possible in case that was the issue. My problem is that the code cannot run without manual intervention (having to push Ctrl-C) after the line <mpileup> Set max per-file depth to 8000 to complete the run, as shown below (I have changed the input file names, but all other code is correct):

testOfFlankingBla$ time python2 ~/software/srst2/scripts/srst2.py --input_pe ../flaA_singleTest/SRRxxxxx_1.fastq.gz ../flaA_singleTest/SRRxxxxx_2.fastq.gz --output SRRxxxxx --gene_db ../flankingBlaBit_cdhit.fasta --log 1968887 reads; of these: 1968887 (100.00%) were paired; of these: 1968800 (100.00%) aligned concordantly 0 times 9 (0.00%) aligned concordantly exactly 1 time 78 (0.00%) aligned concordantly >1 times ---- 1968800 pairs aligned concordantly 0 times; of these: 0 (0.00%) aligned discordantly 1 time ---- 1968800 pairs aligned 0 times concordantly or discordantly; of these: 3937600 mates make up the pairs; of these: 3937577 (100.00%) aligned 0 times 4 (0.00%) aligned exactly 1 time 19 (0.00%) aligned >1 times 0.01% overall alignment rate [samopen] SAM header is present: 100 sequences. [mpileup] 1 samples in 1 input files <mpileup> Set max per-file depth to 8000 sh: 1: OXC8243__27943: not found sh: 1: OXC8243__00001: not found ^Csh: 1: NCTC11168__48: not found sh: 1: NCTC11168__00008: not found ^Csh: 1: ARI2590__39380: not found sh: 1: ARI2590__00095: not found ^Csh: 1: 8096__00098: not found sh: 1: 8096__24271: not found ^C real 14m28.381s user 1m5.051s sys 0m3.154s

i let this run go on (~14 minutes) to see if it was a timing issue (it wasn't). However, i get to the <mpileup> Set max per-file depth to 8000 line after about 90 seconds. Automating this on a folder of Illumina PE sequences is therefore currently not possible. i do get output, including a table of hits. Do you have any idea about why this is happening, and how to solve it?

Thanks in advance,

Patrick

pjbiggs avatar Jan 30 '24 00:01 pjbiggs