woltka
woltka copied to clipboard
Replace instances of `count_list` with `collections.Counter`
https://github.com/qiyunzhu/woltka/blob/862cab09e3da201cf8db20b4968849950fcc0fd5/woltka/util.py#L190-L206
How large do you expect lists to be for using count_list
here? It seems to me it could be replaced by Counter
from pythons collections
. If they will be large lists, Counter
appears to offer a speedup.
In [34]: from collections import Counter
In [35]: def count_list(lst):
...: """Count occurrences of elements of a list and return a dictionary.
...:
...: Parameters
...: ----------
...: lst : list
...: Input list.
...:
...: Returns
...: -------
...: dict
...: Element-to-count map.
...: """
...: res = {}
...: for x in lst:
...: res[x] = res.get(x, 0) + 1
...: return res
...:
In [36]: import string
In [37]: SYMBOLS = string.ascii_uppercase + string.digits
In [38]: SIZE = 100
In [39]: def create_test_set(list_length):
...: for _ in range(SIZE):
...: random_list = random.choices(SYMBOLS, k=list_length)
...: yield random_list
...:
In [40]: for list_length in (2**4, 2**8, 2**16):
...: print("\nList length:", list_length)
...: test_set = list(create_test_set(list_length))
...: print("Python Counter:", end=" ")
...: %timeit [Counter(item) for item in test_set]
...: print("count_list:", end=" ")
...: %timeit [count_list(item) for item in test_set]
...:
List length: 16
Python Counter: 575 µs ± 10 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
count_list: 356 µs ± 5.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
List length: 256
Python Counter: 1.77 ms ± 52.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
count_list: 4.51 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List length: 65536
Python Counter: 369 ms ± 5.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
count_list: 1.13 s ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thanks for doing this rigorous benchmarking! Very helpful information!
The function count_list
is to group multiple subjects mapped to one query. In most situations there won't be many of them. For example, in SHOGUN, BURST returns only one subject, Bowtie2 returns up to 16 subjects. In default Bowtie2, only one subject is returned. Moreover, the change that multiple sample subjects are mapped to one read is even lower. It happens only in BLAST-type algorithms where multiple copies of the same gene are found in one genome. Considering the small number, how about we keep the original count_list
?
Hmmm. I might still propose to replace count_list
with Counter
to cut down on extra code, as this functionality already ships with python, and any code we write is ultimately code we have to maintain. But this can remain an open minor issue for the future.
I understand. Counter
could potentially be very helpful in optimizing other parts of this program. We can track down to these other cases. If we have to use Counter
in at least one place, there is more reason to use Counter
in all cases.
@gwarmstrong I am going back to look at this question. In issue #37, I created a realistic test dataset, based on which I performed some survey and benchmarks.
Now there are three solutions:
- Original, pure Python solution:
def count_list(lst):
res = {}
for x in lst:
res[x] = res.get(x, 0) + 1
return res
-
Counter.
-
Based on the original solution, but replace Python dict with defaultdict:
def count_list(lst):
res = defaultdict(int)
for x in lst:
res[x] += 1
return res
Theoretically, defaultdict
has a higher initial cost than dict
, but as the number of identical keys grow, defaultdict
will outperform dict
.
I first tested the gOTU protocol on the five input alignment files. Runtimes:
- Solution 1: 1:14.29
- Solution 2: 1:17.31
- Solution 3: 1:16.42
Therefore, choice of count_list
does notably (though modestly) impact the overall performance of Woltka. The original solution appears to be the fastest.
I then summarized the ambiguous alignments in sample S01
. Of all 1,342,487 alignments, 424,284 alignments are ambiguous (31.6%). They belong to 134,066 query sequences (~3.16 hits per sequence). The distribution of them is:
Note: unique alignments don't go through the count_list
function. Therefore they are omitted in the following benchmark.
I ran the three solutions against each of the 134,066 lists. Result:
- Solution 1: 84.1 ms ± 250 µs
- Solution 2: 268 ms ± 330 µs
- Solution 3: 116 ms ± 286 µs
Again, the native Python solution is significantly faster than Counter
and the hybrid solution.
Therefore, I propose to keep the original solution.