qiime icon indicating copy to clipboard operation
qiime copied to clipboard

significant unexpected behavior in `split_sequence_file_on_sample_ids.py`

Open wdwvt1 opened this issue 7 years ago • 1 comments

I know QIIME1 support ends soon, but I wanted to record this information somewhere in case people still using it run in to this problem. This also seems like a reasonably serious unexpected behavior because it can result in serious downstream errors.

Using the split_sequence_file_on_sample_ids.py script, if you supply an input fasta file but set the option --file_type fastq, the script will write out per sample fastq files using alternating sequences in the fasta file as quality scores.

For example, if your input.fna file was

>test_sample_0 R0235092:155:000000000-A9A34:1:1101:18633:1000 3:N:0: orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
ACTAAA
>test_sample_1 R0235092:155:000000000-A9A34:1:1101:15249:1000 3:N:0: orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
CCCCC

and you ran

split_sequence_file_on_sample_ids.py -i input.fna --file_type 'fastq' -o out_test

you'd get out_test/test_sample_0.fastq looking like

@test_sample_0 R0235092:155:000000000-A9A34:1:1101:18633:1000 3:N:0: orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
ACTAAA
+
CCCCCC

Notice that input sequence 2 has become the qual score for input sequence 1.

This is made worse by the fact that the uppercase letters {ACTG} are all valid quality scores in phred 33, so rather than getting an error with a downstream step, you will just have silently halved the number of sequences and put in totally misleading quality scores.

wdwvt1 avatar Dec 05 '17 06:12 wdwvt1

Thanks for reporting. Just out of curiosity, are you sure the problem is QIIME1 and not another library, like skbio? I'm a bit concern that the bug still exists somewhere else ...

antgonza avatar Dec 05 '17 13:12 antgonza