woltka Benchmarks of LCA functions

Following @wasade 's suggestions in PR #50 as well as other thoughts, I tested multiple options of the find_lca function. Benchmarks were performed on a Bowtie2 alignment file of 100,000 lines against the WoL database. Summary:

f0: original version: 184 ms ± 2.52 ms
f1: alternative separation: 183 ms ± 483 µs
f2: index range instead of slice: 193 ms ± 743 µs
f3: dict for fast lookup: 243 ms ± 616 µs
f4: loop to avoid error: 199 ms ± 498 µs
f5: comprehension instead of loop: 224 ms ± 1.43 ms

It appears that the original version is almost the best. No option I could think about notably improved performance. This includes the dict solution (f3), which appears to be the slowest, likely due to the overhead of converting the lineage list into a dict. Therefore, eventually I didn't make any change to the function.

Original version:

def f0(taxa, tree):
    itaxa = iter(taxa)
    lineage = get_lineage(next(itaxa), tree)
    if lineage is None:
        return
    for taxon in itaxa:
        try:
            parent = tree[taxon]
        except KeyError:
            return
        this = taxon
        while True:
            try:
                idx = lineage.index(this)
            except ValueError:
                if parent == this:
                    break
                this = parent
                parent = tree[this]
            else:
                if idx + 1 < len(lineage):
                    lineage = lineage[slice(idx + 1)]
                break
    return lineage[-1]

Alternative way of separating a random element from the remaining elements in a frozenset:

def f1(taxa, tree):
    first, *rest = taxa
    lineage = get_lineage(first, tree)
    if lineage is None:
        return
    for taxon in rest:
        try:
            parent = tree[taxon]
        except KeyError:
            return
        this = taxon
        while True:
            try:
                idx = lineage.index(this)
            except ValueError:
                if parent == this:
                    break
                this = parent
                parent = tree[this]
            else:
                if idx + 1 < len(lineage):
                    lineage = lineage[slice(idx + 1)]
                break
    return lineage[-1]

Use index range instead of slice for list:

def f2(taxa, tree):
    itaxa = iter(taxa)
    lineage = get_lineage(next(itaxa), tree)
    if lineage is None:
        return
    cidx = len(lineage)
    get_idx = lineage.index
    for taxon in itaxa:
        try:
            parent = tree[taxon]
        except KeyError:
            return
        this = taxon
        while True:
            try:
                idx = get_idx(this, 0, cidx)
            except ValueError:
                if parent == this:
                    break
                this = parent
                parent = tree[this]
            else:
                cidx = min(cidx, idx + 1)
                break
    return lineage[cidx - 1]

Covert list to a dict of taxon to index to accelerate lookup:

def f3(taxa, tree):
    itaxa = iter(taxa)
    lineage = get_lineage(next(itaxa), tree)
    if lineage is None:
        return
    cidx = len(lineage)
    idxdic = dict(zip(lineage, range(cidx)))
    getidx = idxdic.get
    for taxon in itaxa:
        try:
            parent = tree[taxon]
        except KeyError:
            return
        this = taxon
        while True:
            idx = getidx(this)
            if idx is None or idx >= cidx:
                if parent == this:
                    break
                this = parent
                parent = tree[this]
            else:
                cidx = min(cidx, idx + 1)
                break
    return lineage[cidx - 1]

Note: I benchmarked several options for converting a list to a dict:

v1: {taxon: i for i, taxon in enumerate(lst)}: 78.6 ms ± 480 µs v2: dict(zip(subs, range(len(subs)))): 50 ms ± 179 µs v3: dict(map(reversed, enumerate(lst))): 250 ms ± 2.51 ms

So I chose v2.

Use a loop instead of list.index to avoid error raising:

def f4(taxa, tree):
    itaxa = iter(taxa)
    lineage = get_lineage(next(itaxa), tree)
    if lineage is None:
        return
    cidx = len(lineage)
    for taxon in itaxa:
        try:
            parent = tree[taxon]
        except KeyError:
            return
        this = taxon
        while True:
            idx = None
            for i, t in enumerate(lineage[:cidx]):
                if this == t:
                    idx = i
                    break
            if idx is None:
                if parent == this:
                    break
                this = parent
                parent = tree[this]
            else:
                cidx = min(cidx, idx + 1)
                break
    return lineage[cidx - 1]

Use list comprehension to replace loop:

def f5(taxa, tree):
    itaxa = iter(taxa)
    lineage = get_lineage(next(itaxa), tree)
    if lineage is None:
        return
    cidx = len(lineage)
    for taxon in itaxa:
        try:
            parent = tree[taxon]
        except KeyError:
            return
        this = taxon
        while True:
            idx = next((i for i, t in enumerate(lineage[:cidx])
                        if t == this), None)
            if idx is None:
                if parent == this:
                    break
                this = parent
                parent = tree[this]
            else:
                cidx = min(cidx, idx + 1)
                break
    return lineage[cidx - 1]

Jun 28 '20 08:06 qiyunzhu

Exceptions are expensive. Have you tried not using then?

I'm having a really hard time understanding what these functions are doing. Why is a tree necessary? I don't fully understand the data types here, but assuming taxa look something like:

taxa = [['a', 'b', 'c', 'd'],
        ['a', 'b', 'c', 'e'],
        ['a', 'b', 'x']]

then why not do something like:

# determine the shortest lineage
max_length = min([len(lin) for lin in taxa])
last = None
for position in range(max_length):
    observed = {lin[position] for lin in taxa}
    if len(observed) > 1:
        return last
    else:
        last = list(observed)[0]
return last

It's also not obvious why a list.index type operation is necessary since it should be possible to walk through the lineages in step.

Jun 28 '20 20:06 wasade

@wasade : Thank you for examining my code and proposing your suggestion! Very helpful!

The variable taxa is a frozenset of strings, each of which is a taxon ID and the mission of this function is to find the LCA of them in the classification hierarchy. Unlike SHOGUN and other programs, taxa in Woltka does not contain the lineage information in itself, but the lineage information needs to be looked up from the hierarchy. The function get_lineage does that.

Now assume we did map(partial(get_lineage, tree=tree), taxa) to obtain lineages for all taxa. The outcome will resemble the data structure you typed. However, there is another important difference from a Woltka lineage to a SHOGUN/QIIME/GTDB/MetaPhlAn2 lineage: That the Woltka lineage does not require fixed number of levels. For example, the following lineage is valid:

cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; Homininae; Homo; Homo sapiens

Therefore, the length of a lineage cannot be used to determine LCA.

We once discussed this question in your office. I was suggesting that flexible levels is preferrable for network-like classification systems as well as some non-taxonomy systems.

That being said, Woltka's design is compatible with fixed lineages. When using the same GG-style taxonomy file as input, SHOGUN and Woltka free-rank classification produce exactly the same result.

Finally, the cost of exceptions: I also have the same impression that exceptions are expensive. Therefore wherever it is possible to replace try...except.. with something else, I gave it a try, and benchmarked on one or several real datasets. Results are mixed. Here is another example where try..except.. is cheaper than alternatives in my benchmarks.

To my knowledge, Python's list does not have a exception-free index method (unlike str.find), and there isn't a better way to achieve this. But perhaps it's just that I am not aware yet...

After some intense efforts of optimization, the runtime of Woltka is significantly shrinked in the upgrade branch, and the runtime of no classification (gOTU) vs. free-rank classification are now very close. Therefore I think that it's probably not most urgent to further optimize this part.

No classification: none

Free-rank classification: free

Jun 29 '20 06:06 qiyunzhu

Okay thanks @qiyunzhu. These functions are very difficult to follow... It may be advantageous to improve the readability through judicious comments.

It looks like what may help is reversing lineage up front on the assumption that most LCAs will be species/genus etc, than phylum/class. If accurate it should in effect reduce the number of comparisons made during the linear searches.

try/excepts are faster than a conditional if the exception block is very infrequently encountered. It seems surprising that these are infrequent here given the code. They also have the effect of complicating the interpretation of the code.

Jun 30 '20 17:06 wasade

@wasade Thank you! Your suggestion of reversing the lineage sounds interesting. Let me give it a try.

I benchmarked the program on a set of SHOGUN / Bowtie2 alignment files generated from CAMISIM metagenomes against WoL database. I think this is close to the most realistic scenario.

I can add more details to docstrings & comments to make the logics clear for those important functions.

Jun 30 '20 18:06 qiyunzhu

woltka woltka copied to clipboard

Benchmarks of LCA functions

woltka
woltka copied to clipboard