seqtk
seqtk copied to clipboard
mergpe: Pair up uneven paired FASTQ files [feature request]
I have a pair of FASTQ files so that the paired-end reads no longer line up. I pair them up and drop the orphaned reads like so:
cat reads_1.fq reads_2.fq | paste - - - - | sort | tr '\t' '\n' | seqtk dropse >reads.fq
It would be nice to have a command like seqtk mergpe
that pairs up uneven paired FASTQ files.
Since your input reads are likely in the same order, with the exception of reads which have been orphaned, you can avoid the costly (time and memory) sort
by using process substitution on each input file to make the reads occupy a single line each and then paste the lines alternating from each input file:
paste -d'\n' <(paste - - - - < reads_1.fq) <(paste - - - - < reads_2.fq) \
| tr '\t' '\n' \
| seqtk dropse \
> reads.fq
You would only need to use sort
in the event the ordering of the input files have been tampered, with other than just dropping of reads to create orphans.
Your command doesn't work as is, but it gave me an idea. It would work if the two FASTQ files were already sorted by read ID, and the first paste command was replaced with a merge sort.
Here's an example where the paste of pastes doesn't work:
❯❯❯ cat read1.fa
>1/1
ACGT
>2/1
ACGT
>3/1
ACGT
>4/1
ACGT
>5/1
ACGT
❯❯❯ cat read2.fa
>1/2
ACGT
>3/2
ACGT
>5/2
ACGT
❯❯❯ paste -d '\n' <(paste - - <read1.fa) <(paste - - <read2.fa) | tr '\t' '\n' | seqtk dropse
>1/1
ACGT
>1/2
ACGT
>3/2
ACGT
>3/1
ACGT
Note that read 5 is missing from the output, and read 3 is in an atypical order (read 2 then read 1). This works:
❯❯❯ sort -m <(paste - - <read1.fa) <(paste - - <read2.fa) | tr '\t' '\n' | seqtk dropse
>1/1
ACGT
>1/2
ACGT
>3/1
ACGT
>3/2
ACGT
>5/1
ACGT
>5/2
ACGT
I did imply the input reads need to be in the same order :)
The out-of-order and missing read might be because I didn't account for unequal lengths of the two input files which results in empty lines being piped to seqtk dropse
. These empty lines might be confusing seqtk dropse
and giving the odd behaviour:
paste -d '\n' <(paste - - <read1.fa) <(paste - - <read2.fa) | tr '\t' '\n'
>1/1
ACGT
>1/2
ACGT
>2/1
ACGT
>3/2
ACGT
>3/1
ACGT
>5/2
ACGT
>4/1
ACGT
>5/1
ACGT
Not sure how much CPU time the merge sort takes (minimal I expect) but it shouldn't be needed if the blank lines are removed. (Maybe you can test this as I haven't got access to seqtk
at the moment:
paste -d '\n' <(paste - - <read1.fa) <(paste - - <read2.fa) | sed '/^$/d' | tr '\t' '\n' | seqtk dropse
I did say in the original post I have a pair of FASTQ files so that the paired-end reads no longer line up.
:wink: If the two files have the same number of reads in the same order, the solution is to use seqtk mergepe
rather than seqtk dropse
.
Blank lines have no effect on seqtk dropse
, and removing them likewise has no effect. The merge sort is required for seqtk dropse
. The challenge is that Illumina reads are not (I believe) sorted lexicographically, and so sort -m
does not work without first sorting the reads lexicographically, which is slow.
I should have paid more attention to your example input read names. You're right. To avoid an explicit sort, a bit more programming logic is required to resolve this.
seqtk mergpe
could be made smarter so that it does a merge sort of the two files. It would need an option to specify either a lexicographical sort (like sort -m
) or an Illumina read name sort. Then the pipe seqtk mergpe | seqtk dropse
would work as desired.
https://github.com/linsalrob/fastq-pair
Hello, how to get PE reads and SE reads from uneven paired FASTQ files just using seqtk?
https://github.com/jvivian/fastq_pair