fastp icon indicating copy to clipboard operation
fastp copied to clipboard

Changing ID during merging leads to 'corrupt' fastq-files

Open hovercat opened this issue 4 years ago • 3 comments

Hello there, thank you for fastp. I have used it for roughly a year and am very happy with it.

However, there is one thing that bugs me. When fastp is called to merge paired-end files, it also changes the descriptor on the id-line. I couldn't find an option to turn off this behaviour. When I tried using cutadapt after merging it wouldn't let me, because the description is not the same as the ID line. To be honest, I think cutadapt should print a warning and be done with it. They are kind of right though, as according to most sources, + should either be followed by the id or just with a \n.

Please let me know what you think, all the best!

hovercat avatar Apr 22 '21 13:04 hovercat

@opengene and @sfchen,

Hoping you can (re)visit this, as this is actually pretty critical as it breaks downstream popular tools such as biopython. The issue is the merged FASTQ like so, which is technically invalid FASTQ:

@read1 5 length=151 merged_151_88
ACGTAGGCTCGGCGAAGAAGAACACGACCAGCCGCCGAACCCAGGCGGACGCAGGAGGAAATTGTGGCTGGTGACACCACCATCACCATCGTCGGAAATCTGACCGCTGACCCCGAGCTGCGGTTCACCCCGACCGGTGCGGCCGTGGCGAATTTCACCGTGGCGTCAACGCCCCGGATCTATGACCGTCAGACCGGCGAATGGAAAGACGGCGAAGCGCTGTTCCTCCGGTGCAATAT
+read1 5 length=151
<AAA)FFFAFFFFFFAAAFFF.FFFFFFF<FAFFF<F7FFFFFFF<FFFFFFFF.FFFFFFF.FFFFFFFFF.)F7FFAFFF)F<F)F.FFFFAFFFFAFF<FFAF<F<F)FF7FA)F.FFF<FFF<F7<<..FF.FFF7<F<<.A)FF<<FFFFFF<AFFFFFF)FFFFFFAFF.FFFFFA)FFAFFFFFAFFFFAFFF)7FFFA)FFFFFFFFFFAFFFFAAAFFFFFAAFFAAAAA

This has been visited on the biopython side, but it doesn't seem like biopython will allow this given their spec of FASTQ. Full discussion: https://github.com/biopython/biopython/issues/1898.

Thanks for your consideration and great tool!

schorlton avatar Apr 07 '22 03:04 schorlton

Agreed, this issue needs revisiting. I'm using fastp to merge paired-end reads and it's working great but I'm having to parse every output to fix the incorrect quality captions. Would be great to have the captions match the sequence ones or just have no caption (+\n).

jvfe avatar Apr 26 '23 00:04 jvfe

Ok, I will revise it soon.

sfchen avatar Apr 26 '23 00:04 sfchen