fastp icon indicating copy to clipboard operation
fastp copied to clipboard

Feature request: filter by pattern match

Open perinom opened this issue 3 years ago • 3 comments

Thanks for the tool, it's great!

Would it be possible to implement a filtering strategy based on pattern matching on read sequence, to only include/exclude reads based on e.g. regex? Ideally, you could filter by either/both R1 and R2 and choose whether to keep/discard the mate of a matching read.

It's something at the moment I'm doing with two independent runs of seqkit grep, followed by a mate-fixing step, but it would be great to integrate everything in a single fastp call

perinom avatar Feb 09 '22 16:02 perinom

Could you please give me some examples of the patterns?

sfchen avatar Feb 10 '22 01:02 sfchen

And, is regex required, or is it enough to support just N base?

sfchen avatar Feb 10 '22 01:02 sfchen

Sure!

for the application I have at hands now, for example, I fish for reads that start with a 8bp UMI (so random bases) followed by a spacer of fixed length (8-20bp depending on setup).

I mentioned regexp as an example but anything would do.
I actually use fastp to process the selected reads using --umi_len and --umi_skip so for my specific application recycling part of the code of those options but to filter reads prior to processing would do. I thought of regex to make the filtering option more flexible, but I can imagine it being more elaborate to implement

perinom avatar Feb 10 '22 07:02 perinom