woltka
woltka copied to clipboard
Benchmarks of LCA functions
Following @wasade 's suggestions in PR #50 as well as other thoughts, I tested multiple options of the find_lca
function. Benchmarks were performed on a Bowtie2 alignment file of 100,000 lines against the WoL database. Summary:
- f0: original version: 184 ms ± 2.52 ms
- f1: alternative separation: 183 ms ± 483 µs
- f2: index range instead of slice: 193 ms ± 743 µs
- f3: dict for fast lookup: 243 ms ± 616 µs
- f4: loop to avoid error: 199 ms ± 498 µs
- f5: comprehension instead of loop: 224 ms ± 1.43 ms
It appears that the original version is almost the best. No option I could think about notably improved performance. This includes the dict solution (f3), which appears to be the slowest, likely due to the overhead of converting the lineage list into a dict. Therefore, eventually I didn't make any change to the function.
Original version:
def f0(taxa, tree):
itaxa = iter(taxa)
lineage = get_lineage(next(itaxa), tree)
if lineage is None:
return
for taxon in itaxa:
try:
parent = tree[taxon]
except KeyError:
return
this = taxon
while True:
try:
idx = lineage.index(this)
except ValueError:
if parent == this:
break
this = parent
parent = tree[this]
else:
if idx + 1 < len(lineage):
lineage = lineage[slice(idx + 1)]
break
return lineage[-1]
Alternative way of separating a random element from the remaining elements in a frozenset:
def f1(taxa, tree):
first, *rest = taxa
lineage = get_lineage(first, tree)
if lineage is None:
return
for taxon in rest:
try:
parent = tree[taxon]
except KeyError:
return
this = taxon
while True:
try:
idx = lineage.index(this)
except ValueError:
if parent == this:
break
this = parent
parent = tree[this]
else:
if idx + 1 < len(lineage):
lineage = lineage[slice(idx + 1)]
break
return lineage[-1]
Use index range instead of slice for list:
def f2(taxa, tree):
itaxa = iter(taxa)
lineage = get_lineage(next(itaxa), tree)
if lineage is None:
return
cidx = len(lineage)
get_idx = lineage.index
for taxon in itaxa:
try:
parent = tree[taxon]
except KeyError:
return
this = taxon
while True:
try:
idx = get_idx(this, 0, cidx)
except ValueError:
if parent == this:
break
this = parent
parent = tree[this]
else:
cidx = min(cidx, idx + 1)
break
return lineage[cidx - 1]
Covert list to a dict of taxon to index to accelerate lookup:
def f3(taxa, tree):
itaxa = iter(taxa)
lineage = get_lineage(next(itaxa), tree)
if lineage is None:
return
cidx = len(lineage)
idxdic = dict(zip(lineage, range(cidx)))
getidx = idxdic.get
for taxon in itaxa:
try:
parent = tree[taxon]
except KeyError:
return
this = taxon
while True:
idx = getidx(this)
if idx is None or idx >= cidx:
if parent == this:
break
this = parent
parent = tree[this]
else:
cidx = min(cidx, idx + 1)
break
return lineage[cidx - 1]
Note: I benchmarked several options for converting a list to a dict:
v1: {taxon: i for i, taxon in enumerate(lst)}
: 78.6 ms ± 480 µs
v2: dict(zip(subs, range(len(subs))))
: 50 ms ± 179 µs
v3: dict(map(reversed, enumerate(lst)))
: 250 ms ± 2.51 ms
So I chose v2.
Use a loop instead of list.index
to avoid error raising:
def f4(taxa, tree):
itaxa = iter(taxa)
lineage = get_lineage(next(itaxa), tree)
if lineage is None:
return
cidx = len(lineage)
for taxon in itaxa:
try:
parent = tree[taxon]
except KeyError:
return
this = taxon
while True:
idx = None
for i, t in enumerate(lineage[:cidx]):
if this == t:
idx = i
break
if idx is None:
if parent == this:
break
this = parent
parent = tree[this]
else:
cidx = min(cidx, idx + 1)
break
return lineage[cidx - 1]
Use list comprehension to replace loop:
def f5(taxa, tree):
itaxa = iter(taxa)
lineage = get_lineage(next(itaxa), tree)
if lineage is None:
return
cidx = len(lineage)
for taxon in itaxa:
try:
parent = tree[taxon]
except KeyError:
return
this = taxon
while True:
idx = next((i for i, t in enumerate(lineage[:cidx])
if t == this), None)
if idx is None:
if parent == this:
break
this = parent
parent = tree[this]
else:
cidx = min(cidx, idx + 1)
break
return lineage[cidx - 1]
Exceptions are expensive. Have you tried not using then?
I'm having a really hard time understanding what these functions are doing. Why is a tree necessary? I don't fully understand the data types here, but assuming taxa
look something like:
taxa = [['a', 'b', 'c', 'd'],
['a', 'b', 'c', 'e'],
['a', 'b', 'x']]
then why not do something like:
# determine the shortest lineage
max_length = min([len(lin) for lin in taxa])
last = None
for position in range(max_length):
observed = {lin[position] for lin in taxa}
if len(observed) > 1:
return last
else:
last = list(observed)[0]
return last
It's also not obvious why a list.index
type operation is necessary since it should be possible to walk through the lineages in step.
@wasade : Thank you for examining my code and proposing your suggestion! Very helpful!
The variable taxa
is a frozenset of strings, each of which is a taxon ID and the mission of this function is to find the LCA of them in the classification hierarchy. Unlike SHOGUN and other programs, taxa
in Woltka does not contain the lineage information in itself, but the lineage information needs to be looked up from the hierarchy. The function get_lineage
does that.
Now assume we did map(partial(get_lineage, tree=tree), taxa)
to obtain lineages for all taxa. The outcome will resemble the data structure you typed. However, there is another important difference from a Woltka lineage to a SHOGUN/QIIME/GTDB/MetaPhlAn2 lineage: That the Woltka lineage does not require fixed number of levels. For example, the following lineage is valid:
cellular organisms; Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; Homininae; Homo; Homo sapiens
Therefore, the length of a lineage cannot be used to determine LCA.
We once discussed this question in your office. I was suggesting that flexible levels is preferrable for network-like classification systems as well as some non-taxonomy systems.
That being said, Woltka's design is compatible with fixed lineages. When using the same GG-style taxonomy file as input, SHOGUN and Woltka free-rank classification produce exactly the same result.
Finally, the cost of exceptions: I also have the same impression that exceptions are expensive. Therefore wherever it is possible to replace try...except..
with something else, I gave it a try, and benchmarked on one or several real datasets. Results are mixed. Here is another example where try..except..
is cheaper than alternatives in my benchmarks.
To my knowledge, Python's list
does not have a exception-free index
method (unlike str.find
), and there isn't a better way to achieve this. But perhaps it's just that I am not aware yet...
After some intense efforts of optimization, the runtime of Woltka is significantly shrinked in the upgrade branch, and the runtime of no classification (gOTU) vs. free-rank classification are now very close. Therefore I think that it's probably not most urgent to further optimize this part.
No classification:
Free-rank classification:
Okay thanks @qiyunzhu. These functions are very difficult to follow... It may be advantageous to improve the readability through judicious comments.
It looks like what may help is reversing lineage
up front on the assumption that most LCAs will be species/genus etc, than phylum/class. If accurate it should in effect reduce the number of comparisons made during the linear searches.
try/excepts are faster than a conditional if the exception block is very infrequently encountered. It seems surprising that these are infrequent here given the code. They also have the effect of complicating the interpretation of the code.
@wasade Thank you! Your suggestion of reversing the lineage sounds interesting. Let me give it a try.
I benchmarked the program on a set of SHOGUN / Bowtie2 alignment files generated from CAMISIM metagenomes against WoL database. I think this is close to the most realistic scenario.
I can add more details to docstrings & comments to make the logics clear for those important functions.