qiime
qiime copied to clipboard
significant unexpected behavior in `split_sequence_file_on_sample_ids.py`
I know QIIME1 support ends soon, but I wanted to record this information somewhere in case people still using it run in to this problem. This also seems like a reasonably serious unexpected behavior because it can result in serious downstream errors.
Using the split_sequence_file_on_sample_ids.py
script, if you supply an input fasta
file but set the option --file_type fastq
, the script will write out per sample fastq
files using alternating sequences in the fasta
file as quality scores.
For example, if your input.fna
file was
>test_sample_0 R0235092:155:000000000-A9A34:1:1101:18633:1000 3:N:0: orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
ACTAAA
>test_sample_1 R0235092:155:000000000-A9A34:1:1101:15249:1000 3:N:0: orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
CCCCC
and you ran
split_sequence_file_on_sample_ids.py -i input.fna --file_type 'fastq' -o out_test
you'd get out_test/test_sample_0.fastq
looking like
@test_sample_0 R0235092:155:000000000-A9A34:1:1101:18633:1000 3:N:0: orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
ACTAAA
+
CCCCCC
Notice that input sequence 2 has become the qual score for input sequence 1.
This is made worse by the fact that the uppercase letters {ACTG} are all valid quality scores in phred 33
, so rather than getting an error with a downstream step, you will just have silently halved the number of sequences and put in totally misleading quality scores.
Thanks for reporting. Just out of curiosity, are you sure the problem is QIIME1 and not another library, like skbio? I'm a bit concern that the bug still exists somewhere else ...