woltka icon indicating copy to clipboard operation
woltka copied to clipboard

Replace instances of `count_list` with `collections.Counter`

Open gwarmstrong opened this issue 4 years ago • 4 comments

https://github.com/qiyunzhu/woltka/blob/862cab09e3da201cf8db20b4968849950fcc0fd5/woltka/util.py#L190-L206

How large do you expect lists to be for using count_list here? It seems to me it could be replaced by Counter from pythons collections. If they will be large lists, Counter appears to offer a speedup.

In [34]: from collections import Counter

In [35]: def count_list(lst):
    ...:     """Count occurrences of elements of a list and return a dictionary.
    ...:
    ...:     Parameters
    ...:     ----------
    ...:     lst : list
    ...:         Input list.
    ...:
    ...:     Returns
    ...:     -------
    ...:     dict
    ...:         Element-to-count map.
    ...:     """
    ...:     res = {}
    ...:     for x in lst:
    ...:         res[x] = res.get(x, 0) + 1
    ...:     return res
    ...:

In [36]: import string

In [37]: SYMBOLS = string.ascii_uppercase + string.digits

In [38]: SIZE = 100

In [39]: def create_test_set(list_length):
    ...:     for _ in range(SIZE):
    ...:         random_list = random.choices(SYMBOLS, k=list_length)
    ...:         yield random_list
    ...:

In [40]: for list_length in (2**4, 2**8, 2**16):
    ...:     print("\nList length:", list_length)
    ...:     test_set = list(create_test_set(list_length))
    ...:     print("Python Counter:", end=" ")
    ...:     %timeit [Counter(item) for item in test_set]
    ...:     print("count_list:", end=" ")
    ...:     %timeit [count_list(item) for item in test_set]
    ...:

List length: 16
Python Counter: 575 µs ± 10 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
count_list: 356 µs ± 5.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

List length: 256
Python Counter: 1.77 ms ± 52.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
count_list: 4.51 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

List length: 65536
Python Counter: 369 ms ± 5.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
count_list: 1.13 s ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

gwarmstrong avatar Mar 17 '20 23:03 gwarmstrong

Thanks for doing this rigorous benchmarking! Very helpful information!

The function count_list is to group multiple subjects mapped to one query. In most situations there won't be many of them. For example, in SHOGUN, BURST returns only one subject, Bowtie2 returns up to 16 subjects. In default Bowtie2, only one subject is returned. Moreover, the change that multiple sample subjects are mapped to one read is even lower. It happens only in BLAST-type algorithms where multiple copies of the same gene are found in one genome. Considering the small number, how about we keep the original count_list?

qiyunzhu avatar Mar 19 '20 17:03 qiyunzhu

Hmmm. I might still propose to replace count_list with Counter to cut down on extra code, as this functionality already ships with python, and any code we write is ultimately code we have to maintain. But this can remain an open minor issue for the future.

gwarmstrong avatar Mar 19 '20 18:03 gwarmstrong

I understand. Counter could potentially be very helpful in optimizing other parts of this program. We can track down to these other cases. If we have to use Counter in at least one place, there is more reason to use Counter in all cases.

qiyunzhu avatar Mar 19 '20 19:03 qiyunzhu

@gwarmstrong I am going back to look at this question. In issue #37, I created a realistic test dataset, based on which I performed some survey and benchmarks.

Now there are three solutions:

  1. Original, pure Python solution:
def count_list(lst):
    res = {}
    for x in lst:
        res[x] = res.get(x, 0) + 1
    return res
  1. Counter.

  2. Based on the original solution, but replace Python dict with defaultdict:

def count_list(lst):
    res = defaultdict(int)
    for x in lst:
        res[x] += 1
    return res

Theoretically, defaultdict has a higher initial cost than dict, but as the number of identical keys grow, defaultdict will outperform dict.

I first tested the gOTU protocol on the five input alignment files. Runtimes:

  • Solution 1: 1:14.29
  • Solution 2: 1:17.31
  • Solution 3: 1:16.42

Therefore, choice of count_list does notably (though modestly) impact the overall performance of Woltka. The original solution appears to be the fastest.

I then summarized the ambiguous alignments in sample S01. Of all 1,342,487 alignments, 424,284 alignments are ambiguous (31.6%). They belong to 134,066 query sequences (~3.16 hits per sequence). The distribution of them is:

image

Note: unique alignments don't go through the count_list function. Therefore they are omitted in the following benchmark.

I ran the three solutions against each of the 134,066 lists. Result:

  • Solution 1: 84.1 ms ± 250 µs
  • Solution 2: 268 ms ± 330 µs
  • Solution 3: 116 ms ± 286 µs

Again, the native Python solution is significantly faster than Counter and the hybrid solution.

Therefore, I propose to keep the original solution.

qiyunzhu avatar Apr 18 '20 04:04 qiyunzhu