seqtk mergpe: Pair up uneven paired FASTQ files [feature request]

I have a pair of FASTQ files so that the paired-end reads no longer line up. I pair them up and drop the orphaned reads like so:

cat reads_1.fq reads_2.fq | paste - - - - | sort | tr '\t' '\n' | seqtk dropse >reads.fq

It would be nice to have a command like seqtk mergpe that pairs up uneven paired FASTQ files.

Aug 19 '16 07:08 sjackman

Since your input reads are likely in the same order, with the exception of reads which have been orphaned, you can avoid the costly (time and memory) sort by using process substitution on each input file to make the reads occupy a single line each and then paste the lines alternating from each input file:

paste -d'\n' <(paste - - - - < reads_1.fq) <(paste - - - - < reads_2.fq) \
  | tr '\t' '\n' \
  | seqtk dropse \
> reads.fq

You would only need to use sort in the event the ordering of the input files have been tampered, with other than just dropping of reads to create orphans.

Aug 26 '16 00:08 nathanhaigh

Your command doesn't work as is, but it gave me an idea. It would work if the two FASTQ files were already sorted by read ID, and the first paste command was replaced with a merge sort.

Here's an example where the paste of pastes doesn't work:

❯❯❯ cat read1.fa
>1/1
ACGT
>2/1
ACGT
>3/1
ACGT
>4/1
ACGT
>5/1
ACGT
❯❯❯ cat read2.fa
>1/2
ACGT
>3/2
ACGT
>5/2
ACGT
❯❯❯ paste -d '\n' <(paste - - <read1.fa) <(paste - - <read2.fa) | tr '\t' '\n' | seqtk dropse
>1/1
ACGT
>1/2
ACGT
>3/2
ACGT
>3/1
ACGT

Note that read 5 is missing from the output, and read 3 is in an atypical order (read 2 then read 1). This works:

❯❯❯ sort -m <(paste - - <read1.fa) <(paste - - <read2.fa) | tr '\t' '\n' | seqtk dropse
>1/1
ACGT
>1/2
ACGT
>3/1
ACGT
>3/2
ACGT
>5/1
ACGT
>5/2
ACGT

Aug 26 '16 01:08 sjackman

I did imply the input reads need to be in the same order :)

The out-of-order and missing read might be because I didn't account for unequal lengths of the two input files which results in empty lines being piped to seqtk dropse. These empty lines might be confusing seqtk dropse and giving the odd behaviour:

paste -d '\n' <(paste - - <read1.fa) <(paste - - <read2.fa) | tr '\t' '\n'
>1/1
ACGT
>1/2
ACGT
>2/1
ACGT
>3/2
ACGT
>3/1
ACGT
>5/2
ACGT
>4/1
ACGT

>5/1
ACGT

Not sure how much CPU time the merge sort takes (minimal I expect) but it shouldn't be needed if the blank lines are removed. (Maybe you can test this as I haven't got access to seqtk at the moment:

paste -d '\n' <(paste - - <read1.fa) <(paste - - <read2.fa) | sed '/^$/d' | tr '\t' '\n' | seqtk dropse

Aug 26 '16 02:08 nathanhaigh

I did say in the original post I have a pair of FASTQ files so that the paired-end reads no longer line up. :wink: If the two files have the same number of reads in the same order, the solution is to use seqtk mergepe rather than seqtk dropse.

Blank lines have no effect on seqtk dropse, and removing them likewise has no effect. The merge sort is required for seqtk dropse. The challenge is that Illumina reads are not (I believe) sorted lexicographically, and so sort -m does not work without first sorting the reads lexicographically, which is slow.

Aug 28 '16 19:08 sjackman

I should have paid more attention to your example input read names. You're right. To avoid an explicit sort, a bit more programming logic is required to resolve this.

Aug 28 '16 23:08 nathanhaigh

seqtk mergpe could be made smarter so that it does a merge sort of the two files. It would need an option to specify either a lexicographical sort (like sort -m) or an Illumina read name sort. Then the pipe seqtk mergpe | seqtk dropse would work as desired.

Aug 29 '16 04:08 sjackman

https://github.com/linsalrob/fastq-pair

Jul 25 '19 08:07 alienzj

Hello, how to get PE reads and SE reads from uneven paired FASTQ files just using seqtk?

Jul 25 '19 08:07 alienzj

https://github.com/jvivian/fastq_pair

Jul 25 '19 08:07 alienzj

seqtk seqtk copied to clipboard

mergpe: Pair up uneven paired FASTQ files [feature request]

seqtk
seqtk copied to clipboard