vsearch Improve chimera detection

Improve chimera detection

Open torognes opened this issue 8 years ago • 11 comments

Sep 22 '15 15:09 torognes

The wiki page on chimera detection mentions that chimera detection is monothreaded, and is a bottleneck in amplicon analysis.

To make chimera detection faster, we can modify the uchime algorithm in two ways:

1 - Uchime assumes parent sequences are two times more abundant than their chimera (parent abundance >= chimera abundance x 2). Following that rule, all sequences with the same abundance can be treated in parallel. As low-abundant sequences (abundance of 1, 2 or 3) usually make the bulk of datasets, treating them in parallel will yield an important speed up.

2 - if we change that rule to parent abundance >= chimera abundance + 1, then the number of sequences we can deal with in parallel is even higher, and sensitivity should also increase. On the other hand, the number of potential parents increases too.

To increase sensitivity, we can also change the way uchime treats sequences tagged as chimeras. As of now, the algorithm removes them from downstream comparisons (i.e. chimeras are never considered as potential parents). The objective was to speed up analyzes by decreasing the number of potential parents, but a side effect is to make multithreading difficult (parent searches depend on the result of other parent searches). If we remove that rule, we can detect chimeras of chimeras (sensitivity increase) and parallelize parent searches (speed up).

Mar 07 '16 09:03 frederic-mahe

Parallel --uchime_denovo would be excellent!

parent abundance >= chimera abundance + 1

Uh, I think that's what the UPARSE clustering algorithm already does!

The command -cluster_otus makes use of -uparse_ref internally and it could be running exactly like this, with each existing OTU centroid considered as a potential parent (instead of using a 2x abundance skew like --uchime_denovo). This increased sensitivity could explain why uparse is so good at preventing OTU inflation. It's also in line with Edgars relentless suppression of false positives.

Thanks for discussing this! Colin

Mar 07 '16 18:03 colinbrislawn

Hi, I want to know what is the current state of chimera detection tool and try to help to improve its implementation. This process is still monothread? what do you use to align the sequence? do you have a set of sequence to test it?

Dec 28 '16 14:12 fortizc

Here is the repo for vsearch test data.

Looks like the original UCHIME paper describes using the SIM2, MOCK, and SIMM data sets as inputs, and SIMM is already in the vsearch-data repo!

Edgar has recently updated this algorithm to UCHIME2. He created the CHSIMA database to test it but did not publish this database on his site.

Dec 28 '16 18:12 colinbrislawn

Thanks @colinbrislawn for the answer, the file chimera.cc yet refers to UCHIME and this link http://dx.doi.org/10.1093/bioinformatics/btr381 you can check this here maybe this needs to be updated

Dec 29 '16 19:12 fortizc

Ah, thanks for posting the link to the vsearch implementation of the uchime algorithm.

If an 'improved' chimera checking algorithm were to be added to vsearch, I'm not sure if it would be best to reimplement uchime2 as best as possible, implement something else, or just focus on improving the threading of the current implementation. I trust the discretion of Mahe and Rognes.

Dec 30 '16 17:12 colinbrislawn

Hi @fortizc , thanks for your interest in vsearch. Chimera detection is still performed following the original UCHIME algorithm. As mentioned above, earlier this year I suggested modifications that would allow for parallel processing, but there were more urgent things to implement first.

As a general note, chimera detection is very important and the fact that I have so little time to work on it makes me sad. If you want to experiment with vsearch's source code, please feel free to go ahead.

Jan 04 '17 11:01 frederic-mahe

Hi @frederic-mahe thanks for your answer, I try to implement threads in the uchime mainly in the "de novo" mode. When I have some improvements I will send a PR ;-)

Jan 04 '17 15:01 fortizc

Chimera detection in very long (PacBio) sequences could be improved with the ability to detect multiple breakpoints. See the discussion on the VSEARCH Forum:

https://groups.google.com/d/msg/vsearch-forum/uwfYbFpOeJ4/sQeYEOg8AQAJ

Sep 29 '17 10:09 torognes

I have now added a new experimental algorithm for de novo chimera detection with the command --chimeras_denovo. It is designed to work with PacBio HiFi reads and similar long reads.

It expects "exact" sequences and tries to find 2 or 3 (may be adjusted with --chimeras_parents_max) potential parental sequences that together can cover the entire sequence perfectly. Each region covered by a parental sequence must be at least 10 bp long (by default, may be adjusted with the --chimeras_length_min).

It will partition each query sequence into many sequence parts that are initially each searched against all the more abundant sequences. In contrast to the UCHIME algorithm, which always used 4 parts, the new algorithm will divide the sequence into up to 100 parts. By default it will divide it into (length / 100) parts, but it may be defined with the --chimeras_parts option. For each part, it will keep the 4 most similar candidate parents.

It will then try to cover as much as the query sequence with the parental sequences, starting with the one giving the longest coverage, and then adding parents with shorter coverage. There must be perfect match in the covered regions. The entire sequence must be covered by 2 or 3 parents.

The option names and the output files have been changed slightly from the corresponding UCHIME files.

Below are the relevant options.

Chimera detection with new algorithm
  --chimeras_denovo FILENAME  detect chimeras de novo in long exact sequences
 Parameters
  --abskew REAL               minimum abundance ratio (1.0)
  --chimeras_length_min       minimum length of each chimeric region (10)
  --chimeras_parents_max      maximum number of parent sequences (3)
  --chimeras_parts            number of parts to divide sequences (length/100)
  --sizein                    propagate abundance annotation from input
 Output
  --alignwidth INT            width of alignments in alignment output file (60)
  --alnout FILENAME           output chimera alignments to file
  --chimeras FILENAME         output chimeric sequences to file
  --nonchimeras FILENAME      output non-chimeric sequences to file
  --relabel STRING            relabel nonchimeras with this prefix string
  --relabel_keep              keep the old label after the new when relabelling
  --relabel_md5               relabel with md5 digest of normalized sequence
  --relabel_self              relabel with the sequence itself as label
  --relabel_sha1              relabel with sha1 digest of normalized sequence
  --sizeout                   include abundance information when relabelling
  --tabbedout FILENAME        output chimera info to tab-separated file
  --xsize                     strip abundance information in output

The new algorithm has only been tested briefly, but so far it seems to be able to detect more relevant chimeras than the older algorithms.

Feb 24 '23 16:02 torognes

This is great!

I think I'll finish writing additional black-box tests for the --derep_fulllength, then I'll try to write more tests for this new chimera detection command. Some tests could later be used to set up a formal benchmark comparing the different uchime variants available today.

Feb 27 '23 19:02 frederic-mahe

vsearch vsearch copied to clipboard

Improve chimera detection

vsearch
vsearch copied to clipboard