fastp
fastp copied to clipboard
Feature request: filter by pattern match
Thanks for the tool, it's great!
Would it be possible to implement a filtering strategy based on pattern matching on read sequence, to only include/exclude reads based on e.g. regex? Ideally, you could filter by either/both R1 and R2 and choose whether to keep/discard the mate of a matching read.
It's something at the moment I'm doing with two independent runs of seqkit grep, followed by a mate-fixing step, but it would be great to integrate everything in a single fastp call
Could you please give me some examples of the patterns?
And, is regex required, or is it enough to support just N base?
Sure!
for the application I have at hands now, for example, I fish for reads that start with a 8bp UMI (so random bases) followed by a spacer of fixed length (8-20bp depending on setup).
I mentioned regexp as an example but anything would do.
I actually use fastp to process the selected reads using --umi_len and --umi_skip so for my specific application recycling part of the code of those options but to filter reads prior to processing would do.
I thought of regex to make the filtering option more flexible, but I can imagine it being more elaborate to implement