slamdunk icon indicating copy to clipboard operation
slamdunk copied to clipboard

Invalid quality values in bam file when run with fasta.

Open fabou-uobaf opened this issue 6 years ago • 4 comments

I mapped a fasta (instead of fastq), so holding no quality score in there. Sam specs demand the QUAL field to have the same length as the SEQ field, or to be a ''. In slamdunk dunk it seams that a bam file is produced where the QUAL score if not available consists of an '' and additional seq_length-1 non printable characters.

please try samtools view slamdunk_mapped.bam | perl -F"\t" -lane 'BEGIN{print join("\t", qw/totalLength >QUALstring< lengthNonprintableChars lengthPrintableChars/)}$cnt_n_print = 0; $cnt_print = 0;while ($F[10] =~ m/[^[:print:]]/g) {$cnt_n_print++};while ($F[10] =~ m/[[:print:]]/g) {$cnt_print++};print join("\t", length($F[10]), ">$F[10]<", $cnt_n_print, $cnt_print);' | head

fabou-uobaf avatar Jun 07 '18 20:06 fabou-uobaf

I'm actually surprised slamdunk does not crash immediately. Why would you want to map a fasta file and then what exactly are you expecting the qual-score field in the resulting bam file to be?

t-neumann avatar Jun 07 '18 20:06 t-neumann

Sorry, I lost a few characters in my previous post, in particular Asteriks *.

Bottom line what I wanted to make you aware of is that the bam file produced by slamdunk does not adhere to the sam file specificiation if there is no sequence quality information available. This prohibits e.g. to run samtools view -bS on the before decompressed sam-file. I though this might be of interest to you.

From the sam file specifications: "QUAL: ... This field can be a * when quality is not stored. If not a *, SEQ must not be a * and the length of the quality string ought to equal the length of SEQ."

fabou-uobaf avatar Jun 08 '18 09:06 fabou-uobaf

Ok I didn't know that - will forward it to the NextGenMap developer, that's not really something Slamdunk itself introduces. Thanks.

t-neumann avatar Jun 08 '18 09:06 t-neumann

Hi and thank you for reporting this.

Unfortunately, I could not reproduce it. I ran slamdunk map and slamdunk filter on a small FASTA file and got '*' as quality string for all reads. Your script gave the expected result as well:

totalLength	>QUALstring<	lengthNonprintableChars	lengthPrintableChars
1	>*<	0	1
1	>*<	0	1
1	>*<	0	1
1	>*<	0	1
1	>*<	0	1
1	>*<	0	1
1	>*<	0	1
1	>*<	0	1
1	>*<	0	1

Could you send me a part of the FASTA file that caused the problem? And could you please tell me the version of slamdunk and pysam you are using and how you installed SlamDunk?

pip freeze | grep pysam

Thanks, Philipp

philres avatar Jun 12 '18 21:06 philres