fastp
fastp copied to clipboard
Low complexity filter and contaminant removal
Thanks for a super nice tool. I wonder, if the following to functionalities could be included into fastp.
-
A low complexity filter using methods such as Entropy or Dust. The current filter does not work well on tandem repeats and similar type of low complexity sequences.
-
Removal of contaminants like PhiX. I always see some level of phix remaining in the files that come from the sequencing provider and a functionality in fastp would remove this in the same go as the other steps performed by the tool. Separate mapping e.g. via Bowtie2 is quite slow.
Thanks for considering these options. Kind regards, Martin
Yep I agree, would be nice just saw this when thinking fastp could remove low complexity tandem repeats. And yes fastp is a great tool! I use it for most of my QC and trimming needs.
Same problem here as point 1, with tandem repeats being kept. It would be great to have other methods (entropy or dust) to remove low complexity reads. Not sure about dust, but I think entropy should not be too dificult. As far as I can see, it would be something like:
Sequence:
AGAGAGAGAGAGAGAGAGAG
Entropy
p(A) = 10/20 = 0.5
p(C) = 0
p(G) = 10/20 = 0.5
p(T) = 0
Entropy = - 0.5*log(0.5) - 0.5*log(0.5) = 0.3
Complexity
Complexity = 19/(20-1) = 1
Both could also be calculated on sliding windows (like in BBduk), to account for partial low complexity.
I just wanted to echo the suggestions above. I think it could make fastp even more valuable to be able to filter low complex reads on the fly, using dust and or entropy.