fastp
fastp copied to clipboard
Changing ID during merging leads to 'corrupt' fastq-files
Hello there, thank you for fastp. I have used it for roughly a year and am very happy with it.
However, there is one thing that bugs me. When fastp is called to merge paired-end files, it also changes the descriptor on the id-line. I couldn't find an option to turn off this behaviour. When I tried using cutadapt after merging it wouldn't let me, because the description is not the same as the ID line. To be honest, I think cutadapt should print a warning and be done with it. They are kind of right though, as according to most sources, + should either be followed by the id or just with a \n.
Please let me know what you think, all the best!
@opengene and @sfchen,
Hoping you can (re)visit this, as this is actually pretty critical as it breaks downstream popular tools such as biopython. The issue is the merged FASTQ like so, which is technically invalid FASTQ:
@read1 5 length=151 merged_151_88
ACGTAGGCTCGGCGAAGAAGAACACGACCAGCCGCCGAACCCAGGCGGACGCAGGAGGAAATTGTGGCTGGTGACACCACCATCACCATCGTCGGAAATCTGACCGCTGACCCCGAGCTGCGGTTCACCCCGACCGGTGCGGCCGTGGCGAATTTCACCGTGGCGTCAACGCCCCGGATCTATGACCGTCAGACCGGCGAATGGAAAGACGGCGAAGCGCTGTTCCTCCGGTGCAATAT
+read1 5 length=151
<AAA)FFFAFFFFFFAAAFFF.FFFFFFF<FAFFF<F7FFFFFFF<FFFFFFFF.FFFFFFF.FFFFFFFFF.)F7FFAFFF)F<F)F.FFFFAFFFFAFF<FFAF<F<F)FF7FA)F.FFF<FFF<F7<<..FF.FFF7<F<<.A)FF<<FFFFFF<AFFFFFF)FFFFFFAFF.FFFFFA)FFAFFFFFAFFFFAFFF)7FFFA)FFFFFFFFFFAFFFFAAAFFFFFAAFFAAAAA
This has been visited on the biopython side, but it doesn't seem like biopython will allow this given their spec of FASTQ. Full discussion: https://github.com/biopython/biopython/issues/1898.
Thanks for your consideration and great tool!
Agreed, this issue needs revisiting.
I'm using fastp to merge paired-end reads and it's working great but I'm having to parse every output to fix the incorrect quality captions. Would be great to have the captions match the sequence ones or just have no caption (+\n).
Ok, I will revise it soon.