fastp icon indicating copy to clipboard operation
fastp copied to clipboard

Filter only one read of paired-end experiment but remove corresponding reads in other file [feature request]

Open romanhaa opened this issue 6 years ago • 11 comments

This tool is really fantastic, I would love to integrate it into my workflows. However, there is one thing that would be very useful. In our single-cell experiments we have paired-end reads, the first containing only barcodes and the second the transcripts. Would it be possible to apply filtering only to the second reads and then remove the corresponding first reads? Otherwise the reads will be out of sync and cause problems down-stream.

Thanks!

romanhaa avatar Oct 26 '18 08:10 romanhaa

Could you please paste some reads of your data here?

sfchen avatar Nov 01 '18 06:11 sfchen

No problem, here you go:

R1 (26 bp):

@A00302:38:HG35HDMXX:1:1101:2410:1000 1:N:0:CCACTACA
NTGTAGCTCGATCCCTAGCTTTAGCC
+
#FFFFF:FFFFF:FFFFFFFFFFFFF
@A00302:38:HG35HDMXX:1:1101:2826:1000 1:N:0:CCACTACA
NCTTTCTGTGAGTGACTCTTCTCTTA
+
#FFFFFFFFFFFFFFFFFFFFFFFFF
@A00302:38:HG35HDMXX:1:1101:3531:1000 1:N:0:CCAATACA
NCGTACTAGACCACGAGGGTTATCCT
+
#,,FFFFF:F::F,::FFF,FF:F:,
@A00302:38:HG35HDMXX:1:1101:4833:1000 1:N:0:CCACTACA
NGTGAAGTCTTAGCCCCTGTTTCAGC
+
#FFFFFFFFF,FFFF:FFF:FFFFFF

R2 (91 bp):

@A00302:38:HG35HDMXX:1:1101:2410:1000 2:N:0:CCACTACA
GGGGGAATAAAAAAGTTAAAAAAATAAAAAAAAAAATCTCCCCCAAAAAAACCAAAAAAAAAAACAAAGAAAAAAAGCAAAAAAAATCTTT
+
,F,,,,:,,,::::,:F,:F,F,,,,,:FFFFFF:,:F,,,:,F::,,,,::,:,:FFFFFFF,,,F,:,,,FF,,,,,,F,F,:,,,,,F
@A00302:38:HG35HDMXX:1:1101:2826:1000 2:N:0:CCACTACA
CTCTGTCCTTAAGAAGAGATTGTTACCAAGACTCCAGGCTAGGAGAGATTGCAGTTATCCACCAATCATACAGTGTGCTATGCTTCTGTGC
+
FFFFF::FFFFFF:FF:FF,FFFFFFFF:FFFFFFF,FFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFF::FFFF
@A00302:38:HG35HDMXX:1:1101:3531:1000 2:N:0:CCAATACA
GTGGAAGATGAGAAACTTCAAGGCAAGATCAATGATTAGGACAAACAGAAGATTCTTGACAAGTGCAATGAAATCATCAGCTGGCTGGAAA
+
FF:FFF,F,FF:FFFFF::F,FF:FF,FFFFFF,:F,F,FF:FF,:FFFFFFFFFFFFFFFFFFFFFFF,FFF:FFFF:,FFF:,:,FF,F
@A00302:38:HG35HDMXX:1:1101:4833:1000 2:N:0:CCACTACA
GATACCTTGGCTGTGGCCACGGACACAAAGGCCACCCGGGCCGTCCACACTGGTCTTGCTGTGGGAAGTTCATTGAGAAGTCCGAGTGCTC
+
FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFF:FFFF:FFFFFFFFF:F,FFFFFFFFF,FF,F:FFF,FFFFFFFFFF

I hope 4 reads are enough. The read name up until the space is identical between R1 and R2, the part after that differs only in the first character (1 for R1, 2 for R2).

romanhaa avatar Nov 01 '18 14:11 romanhaa

Emm, this is not a common feature request and requires some customization

sfchen avatar Nov 01 '18 14:11 sfchen

Sorry, I think my explanation was a bit poor. Let me try again, if you still think it's too specific I understand.

In single cell RNA-seq experiments it is very common to have paired-end sequencing. The first read contains only barcodes, in the case of 10x Genomics kits (which are very common as well) it 1 barcode for the cell (16 bp) and 1 barcode for each transcript (10 bp), resulting in a total of 26 bp. It would be silly to apply filtering and trimming to this read because these sequences are artificial and always have to be the same length. Instead, the second read comes from the actual transcript, therefore it makes sense to apply filtering and trimming for quality control.

At our institute, we use a Illumina NovaSeq sequencer, I would say it's a fairly common machine. I receive two FASTQ files, one containing all read 1, and the other containing all read 2, of course inside the corresponding pairs are in the same order.

As described above, if I run fastp on both FASTQ files I run into problems because barcodes are filtered or trimmed. Instead, if I submit only the FASTQ file of the read 2 (from the transcript), the output FASTQ will have fewer reads than read 1 (because it wasn't filtered). This will cause problems later on because the paired FASTQ files are expected to be in sync.

When I say 'in sync', I mean that the read pairs are in the same order in their respective FASTQ file, and that only pairs are present. That means, if a read was removed from the R2 FASTQ file, the corresponding read in the R1 FASTQ must be removed as well. Otherwise, you'll end up having different number of reads in the FASTQ files.

So I was wondering, since your tool is already able to process paired-end FASTQ files and keep them in sync, if it was possible to use both FASTQ files as input for fastp, but then apply the filtering only to the reads of one of the two files (in this case read 2) and just keep the other one in sync. Technically, I assume this shouldn't be a big challenge. I would like to look at this myself but I'm not very familiar with C++ and don't have spare time at the moment.

After running fastp only on read 2 (in single-end mode), I tried to 'sync' the files manually myself but wasn't able to make it work.

Anyway, sorry for the wall of text but I hope this better explains the situations.As I said, I think it shouldn't be too challenging and in my opinion a lot of people could benefit from this (also because single cell RNA-seq experiments are becoming more and more standard and almost always require paired-end sequencing, often having one of the reads containing just barcodes).

Thanks for your time 👍

romanhaa avatar Nov 01 '18 16:11 romanhaa

Good, thanks for your clear explanation, I will make it happen in future release.

sfchen avatar Nov 02 '18 00:11 sfchen

@sfchen Thanks a lot 🙏

romanhaa avatar Nov 02 '18 14:11 romanhaa

I too would like to see this for single cell data.

murphycj2 avatar Jul 11 '19 19:07 murphycj2

I am not sure if this is implemented. There are some flags for UMI processing, but I am not sure if they apply.

Maarten-vd-Sande avatar Nov 27 '20 16:11 Maarten-vd-Sande

Hi there,

I've had the same use case spring up where I've only wanted to trim + filter based on a single biological read, and not associated barcode, umi, etc reads. For anyone interested, I found a pretty easy work around using fastq-pair.

Assume we have r1.fq and r2.fq. We want to filter both r1 and r2 based on reads in r1:

  • Run fastp on r1.fq to filter as you desire. Output to fastp_r1.fastq
  • Match passing reads in fastp_r1.fastq with reads in r2.fq using fastq-pair.

Hope this helps!

dakota-hawkins avatar Mar 29 '21 18:03 dakota-hawkins

Yes that's also how I ended up fixing it! Thanks for actually putting the answer here :smile:

Maarten-vd-Sande avatar Mar 29 '21 18:03 Maarten-vd-Sande

Hope to integrated into fastp

Yunuuuu avatar Mar 07 '24 07:03 Yunuuuu