fastp icon indicating copy to clipboard operation
fastp copied to clipboard

Low complexity filter and contaminant removal

Open martin-hartmann opened this issue 4 years ago • 3 comments

Thanks for a super nice tool. I wonder, if the following to functionalities could be included into fastp.

  1. A low complexity filter using methods such as Entropy or Dust. The current filter does not work well on tandem repeats and similar type of low complexity sequences.

  2. Removal of contaminants like PhiX. I always see some level of phix remaining in the files that come from the sequencing provider and a functionality in fastp would remove this in the same go as the other steps performed by the tool. Separate mapping e.g. via Bowtie2 is quite slow.

Thanks for considering these options. Kind regards, Martin

martin-hartmann avatar Apr 15 '21 16:04 martin-hartmann

Yep I agree, would be nice just saw this when thinking fastp could remove low complexity tandem repeats. And yes fastp is a great tool! I use it for most of my QC and trimming needs.

hermidalc avatar Aug 20 '22 12:08 hermidalc

Same problem here as point 1, with tandem repeats being kept. It would be great to have other methods (entropy or dust) to remove low complexity reads. Not sure about dust, but I think entropy should not be too dificult. As far as I can see, it would be something like:

Sequence:

AGAGAGAGAGAGAGAGAGAG

Entropy

p(A) = 10/20 = 0.5
p(C) = 0
p(G) = 10/20 = 0.5
p(T) = 0

Entropy = - 0.5*log(0.5) - 0.5*log(0.5) = 0.3

Complexity

Complexity = 19/(20-1) = 1

Both could also be calculated on sliding windows (like in BBduk), to account for partial low complexity.

fgvieira avatar Oct 30 '23 09:10 fgvieira

I just wanted to echo the suggestions above. I think it could make fastp even more valuable to be able to filter low complex reads on the fly, using dust and or entropy.

miwipe avatar Feb 20 '24 13:02 miwipe