fastp icon indicating copy to clipboard operation
fastp copied to clipboard

Fastq files with MS-DOS end-line break

Open Erlor opened this issue 5 years ago • 4 comments

Hello, So I have been running fastp and it has worked previously, but recently I ran a sample where roughly 100K reads out of 1M reads were trimmed and then fastp stops with the message:

ERROR: sequence and quality have different length:
@read name
Read Sequence

+

I did dig a bit in the fastqreader code and it seems that there's some offset by 1 here so that the separator line now is empty and the separator instead occupies the line for quality.

I did find that the file has ^M$ at the end of each line indicating the file has been saved on a windows machine. By looking at the getline function in the fastqreader it seems you tried to deal with this, but it seems the problem has persisted.

Erlor avatar Jan 21 '20 16:01 Erlor

Can you paste the output of od -c < your.fastq | head -n 100 (or zcat your.fastq.gz | oc -c | head -n 100 in case you FASTQ file is compressed)?

mschilli87 avatar Jan 21 '20 17:01 mschilli87

The output was quite lengthy. I also tried if it can be reproduced in another way, which it can. It can be achieved by running any fastq file through unix2dos (Basically introducing the \r to each line ending).

0000000   @   S   R   R   8   5   6   1   4   1   3   .   1       1   /
0000020   1  \r  \n   A   T   G   G   C   G   G   C   G   G   C   G   G
0000040   G   C   C   T   G   G   C   G   G   A   A   C   T   G   C   T
0000060   G   G   G   C   G   G   A   A   G   C   C   C   G   A   C   G
0000100   C   A   G   G   T   G   T   G   C   A   T   C   G   C   G   G
0000120   C   T   G   A   A   A   T   C   G   G   C   A   T   G   G   A
0000140   A   C   A   T   A   A   C   C   T   T   G   G   T   T   A   A
0000160   C   C   T  \r  \n   +  \r  \n   C   C   D   A   C   B   6   ;
0000200   ;   7   ;   ;   ;   1   ;   6   ;   ;   6   ;   ;   6   ;   6
0000220   ;   2   :   :   2   2   2   *   2   2   .   2   .   2   ;   C
0000240   9   ;   ;   ;   ;   0   0   /   +   .   .   -   -   4   4   B
0000260   C   C   C   C   B   ?   >   >   C   D   >   C   C   C   A   C
0000300   D   F   F   >   >   ;   :   :   :   8   2   8   2   8   2   1
0000320   )   1   )   1   3   3   7   .  \r  \n   @   S   R   R   8   5
0000340   6   1   4   1   3   .   2       2   /   1  \r  \n   G   A   C
0000360   T   G   A   A   G   C   A   G   G   G   C   A   G   C   T   C
0000400   T   A   C   T   T   T   G   A   G   C   G   G   T   G   C   A
0000420   G   G   C   G   T   A   T   T   G   T   G   G   A   T   G   A
0000440   A   G   C   G   A   G   G   C   T   G   G   C   A   C   A   T
0000460   G   A   A   C   A   G   T   T   A   C   C   G   G   G   G   T
0000500   G   A   T   A   G   T   G   A   T   A   G   A   C   G   A   C
0000520   C   T   C   G   G   G   T   A   T   G   T   C   A   A   A   C
0000540   G   C   G   A   C   A   A   C   G   C   G   G   A   A   A   C
0000560   G   G   G   G   T   G   T   G   T   T   G   T   T   C   G   A
0000600   G   G   C   G   T   G   A   T   C   G   C   A   C   A   T   C
0000620   G   T   T   A   T   G   A   G   C   G   C   G   G   C   A   G
0000640   T   C   T   G   G   T   C   G   A   T   C   A   C   C   A   G
0000660   C   A   A   T   C   A   G   T   C   C   G   T   T   C   A   G
0000700   C   A   C   G   T   G   G   G   G   G   T   A   G   C   A   T
0000720   C   T   T   C   G   T   G   G   A   T   G   A   A   A   C   G
0000740   G   A   T   G   G   C   G   G   T   G   G   C   A   G   C   A
0000760   G   C   C   G   A   T   C   G   T   C   T   G   A   T   C   C
0001000   A   G   T   A   C   G   G   G   G   T   A   T   A   T   G   T
0001020   T   C   G   A   G   C   T   A   A   A   A   A   G   G   A   G
0001040   A   A   A   G   T   T   A   C   C   G   G   A   A   A   A   A
0001060   G   A   C   T   G   G   C   A   A   G   G   C   A   G   T   A
0001100   A   C   C   A   G   C   G   T   G   G  \r  \n   +  \r  \n   >
0001120   @   @   D   C   C   =   @   @   ;   ;   ;   1   5   4   ;   ;
0001140   ;   ;   @   ;   ;   :   9   *   :   B   B   ;   ;   7   ;   ;
0001160   <   /   /   *   -   .   .   ;   ;   8   ;   ;   ;   6   ;   @
0001200   @   A   >   C   >   ?   ;   A   =   @   :   ;   7   ;   ;   6
0001220   6   ;   ;   ;   >   C   ;   6   ;   7   ;   D   0   ;   5   ;
0001240   ,   5   4   C   C   C   D   C   C   C   C   ?   ;   ;   A   @
0001260   ;   ;   6   ;   :   :   :   /   9   B   @   :   2   2   2   :
0001300   /   9   A   :   :   :   :   :   <   :   :   :   B   4   :   9
0001320   /   2   :   :   :   *   2   :   9   :   @   >   :   8   5   /
0001340   /   -   /   '   -   -   -   -   -   3   9   9   9   ?   :   :
0001360   :   :   3   :   1   .   /   /   -   2   7   7   <   ?   9   >
0001400   ?   ?   ?   @   @   @   ?   ;   :   :   :   :   :   :   :   4
0001420   9   ?   8   8   2   :   @   >   ?   ?   B   ;   ?   ?   =   A
0001440   A   @   A   :   8   :   :   8   8   8   8   *   8   8   8   8
0001460   7   :   =   C   3   3   :   1   1   2   -   -   -   -   4   -
0001500   -   -   '   -   8   7   2   8   B   <   >   ?   ;   ?   ;   ;
0001520   ;   ;   ;   C   ?   C   9   9   0   0   0   >   8   <   7   0
0001540   0   *   0   /   /   /   /   /   /   /   (   8   7   1   1   1
0001560   1   1   )   1   1   1   1   1   1   1   3   3   3   0   ;   ;
0001600   @   @   @   D   1   :   :   =   B   B   =   A   =   A   C   C
0001620   C   3   :   0   0   0   0   ,   /   0   5   :   *   0   ;   :
0001640   <   =   7   <   9   ?   ?   <   7   7   7   )  \r  \n   @   S
0001660   R   R   8   5   6   1   4   1   3   .   3       3   /   1  \r
0001700  \n   T   C   C   C   T   T   C   A   T   A   C   T   G   C   A
0001720   C   G   T   A   G   A   G   C   T   G   C   C   G   C   A   G
0001740   T   T   C   A   T   C   G   G   C   A   T   A   A   G   C   C
0001760   T   G   C   A   G   G   A   A   T   T   C   C   G   G   G   G
0002000   T   A   A   A   G   A   C   G   T   C   A   C   G   A   C   G
0002020   G   T   G   C   T   G   C   A   A   C   C   A   G   C   G   C
0002040   T   G   C   T   G   C   T   C   G   G   C   T   T   C   G   T
0002060   C   C   A   G   C   G   T   G   C   C   G   G   G   G   A   A
0002100   A   T   T   A   C   G   C   G   C   C   C   G   A   T   A   G
0002120   T   T   G   A   A   C   A   G   C   A   G   T   T   T   C   T
0002140   C   G   A   T   G   C   G   T   T   T   A   T   C   G   G   C
0002160   A   A   A   C   G   T   G   A   G   G   T   C   C   A   G   C
0002200   G   C   G   G   G   C   A   G   A   T   T   G   C   G   T   G
0002220   C   T   C   G   G   T   C   T   C   C   A   G   C   A   G   G
0002240   A   T   T   T   T   C   A   T   C   G   C   C   G   C   G   C
0002260   G   G   T   C   C   G   C   A   T   C   G   C   T   G   A   A
0002300   G   A   A   A   C   C   A   T   T   G   T   A   G   A   G   C
0002320   T   G   G   G   T   G   T   C   G   A   C   G   T   T   T   T
0002340   C   C   G   A   C   G   G   C   A   C   A   A   A   A   G   G
0002360   C   T   C   C   G   C   C   T   C   G   G   C   A   A   A   A
0002400   G   T   C   G   C   C   A   C   G   A   C   T   T   G   C   G
0002420   C   G   T   A   C   C   G   T   G   C   G   G   T   A   T   C
0002440   G   C   G   C   A  \r  \n   +  \r  \n   0   6   6   ,   6   0
0002460   6   ;   ;   ;   ;   >   D   C   B   C   C   B   ;   ;   ;   :
0002500   :   :   :   :   4   :   :   :   :   :   4   :   @   =   =   ?
0002520   @   ?   C   D   G   A   D   C   D   @   >   <   A   A   7   ;
0002540   7   <   7   <   7   ;   ;   ;   /   ;   B   B   ;   B   B   ;
0002560   >   ?   ?   ?   >   C   D   B   @   ;   ;   ;   ;   ;   >   D
0002600   D   C   F   A   D   D   C   C   C   @   A   @   >   <   A   A
0002620   ;   ;   7   ;   @   =   ;   ;   >   B   ?   C   C   C   C   C
0002640   C   E   @   B   B   B   .   :   :   /   :   5   :   :   :   =
0002660   :   :   :   /   =   B   @   @   @   C   5   :   :   5   :   :
0002700   :   :   :   :   B   B   7   B   B   <   8   :   :   @   @   :
0002720   :   :   /   :   8   8   8   2   8   C   C   >   C   C   C   B
0002740   @   @   7   ;   ;   7   ;   ;   ;   ;   @   @   @   :   ?   >
0002760   >   >   D   A   C   C   C   D   C   C   C   C   C   @   C   C
0003000   B   C   <   @   :   :   :   :   5   :   8   8   8   *   8   =
0003020   <   <   <   B   5   ;   ;   ;   :   8   =   8   8   4   ;   :
0003040   :   B   B   >   ?   ?   ?   ?   ;   ?   9   9   3   9   3   7
0003060   7   *   0   0   0   0   0   0   8   8   :   :   -   0   :   :

Erlor avatar Jan 22 '20 16:01 Erlor

My best guess is that one of the following lines should match the other as they seem to be supposed do the same job but differ:

https://github.com/OpenGene/fastp/blob/e01e9402c3d5afded49b21c8303be51d7cbb2d27/src/fastqreader.cpp#L116-L118

https://github.com/OpenGene/fastp/blob/e01e9402c3d5afded49b21c8303be51d7cbb2d27/src/fastqreader.cpp#L145-L147

Right now I'm not in the state of mind to dig deeper but it looks like the latter is the older version and the former was changed fixing https://github.com/OpenGene/fastp/issues/133 in https://github.com/OpenGene/fastp/commit/e01e9402c3d5afded49b21c8303be51d7cbb2d27.

Maybe this gives @sfchen an idea what's happening or maybe it helps someone else that has time to tackle this.

mschilli87 avatar Jan 22 '20 16:01 mschilli87

Hi, I have the similar error.

But my fastq file is the windows format, which has endings of \r\n I have removed any mBuf[end-1]=='\r' or mBuf[end]=='\r' in getLine() and it works well

rocpengliu avatar Aug 21 '21 00:08 rocpengliu