woltka icon indicating copy to clipboard operation
woltka copied to clipboard

Optimize cigar_to_lens function

Open qiyunzhu opened this issue 4 years ago • 1 comments

What is CIGAR and why optimize

The function cigar_to_lens is to calculate the length of an alignment from the CIGAR string in a SAM file. The specification of SAM and CIGAR can be found here. This function needs to be executed for numerous times in an analysis (once per line), and is among the bottlenecking steps.

In a test on the medium test dataset S01:

  • Read SAM file but don't parse: 0m9.875s.
  • Read and parse SAM file, but skip CIGAR: 0m12.172s.
  • Read and parse SAM file, including CIGAR: 0m15.372s.

Therefore it is worth to further optimize this function.

Frequencies of strings and operators

In test dataset S01 (generated using SHOGUN/Bowtie2), there are 1342487 lines, of which the top 10 most frequent CIGAR strings are:

1325416 150M
2473 146M1D4M
1378 145M1D5M
684 144M1D6M
636 4M1I145M
552 145M1I4M
364 144M1I5M
353 143M1D7M
268 5M1I144M
211 143M1I6M

The frequencies of the operator codes are:

M 1364229
D 11833
I 9909

Original function

def cigar_to_lens(cigar):
    """Extract lengths from a CIGAR string.

    Parameters
    ----------
    cigar : str
        CIGAR string.

    Returns
    -------
    int
        Alignment length.
    int
        Offset in subject sequence.

    Raises
    ------
    ValueError
        CIGAR string is missing.
    """
    if cigar in ('', '*'):
        raise ValueError('Missing CIGAR string.')
    align, offset = 0, 0
    ops = 'DHIMNPSX='
    n = ''  # current step size
    for c in cigar:
        if c in ops:
            if c in 'M=X':
                align += int(n)
            if c in 'MDN=X':
                offset += int(n)
            n = ''
        else:
            n += c
    return align, offset

Step 1: Remove validity check.

There is no need to test for a invalid CIGAR string as long as the SAM file is valid.

%timeit before('150M')
1.43 µs ± 7.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit after('150M')
1.3 µs ± 6.56 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Step 2: Suppress variable / constant.

Instead of:

ops = 'DHIMNPSX='
for c in cigar:
    if c in ops:

Do:

for c in cigar:
    if c in 'DHIMNPSX=':

Also test global constant outside function.

%timeit before('150M')
1.26 µs ± 5.17 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit after('150M')
1.21 µs ± 4.61 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit global('150M')
1.35 µs ± 5.03 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Step 3: Sort operator codes by frequency.

So that M. D and I come first:

%timeit 'M' in 'DHIMNPSX='
40.2 ns ± 0.0414 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit 'M' in 'MDIHNPSX='
38.4 ns ± 0.0271 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Step 4: (DON'T) Replace string with tuple.

This questions depends on actual scenarios. Sometimes it gains performance and sometimes it loses. In our scenario, replacing string with tuple results in a gain of performance if query is in the string, however there is a severe loss of performance if not. Overall, since a big proportion of CIGAR characters are NOT operators, this move is not favorable.

'M' in 'MDIHNPSX='
42.5 ns ± 0.0288 ns
'M' in ('M', 'D', 'I', 'H', 'N', 'P', 'S', 'X', '=')
29.8 ns ± 0.0186 ns
'1' in 'MDIHNPSX='
45.3 ns ± 0.317 ns
'1' in ('M', 'D', 'I', 'H', 'N', 'P', 'S', 'X', '=')
232 ns ± 8.12 ns

Step 5: (DON'T) Use str.find().

'M' in 'MX='
38.1 ns ± 0.19 ns
'D' in 'MX='
39.6 ns ± 0.165 ns
'I' in 'MX='
39.4 ns ± 0.0296 ns

'MX='.find('M'): 178 ns ± 0.315 ns 'MX='.find('D'): 184 ns ± 0.0941 ns 'MX='.find('I'): 183 ns ± 0.224 ns

Step 6: Optimize switch cases.

def before(cigar):
   ...
    if c in 'M=X':
        align += int(n)
    if c in 'MDN=X':
        offset += int(n)
   ...
def after(cigar):
   ...
    if c in 'M=X':
        i = int(n)
        align += i
        offset += i
    elif c in 'DN':
        offset += int(n)
   ...

before('150M'): 1.26 µs ± 6 ns after('150M'): 1 µs ± 10.8 ns before('137M1I3M1D9M'): 3.53 µs ± 33.5 ns after('137M1I3M1D9M'): 2.85 µs ± 12.6 ns

Step 7: (DON'T) Use str.isdigit().

def after(cigar):
    ...
    if c.isdigit():
        n += c
    else:
        ...
        n = ''
    ...

before('150M'): 1 µs ± 9.81 ns after('150M'): 1.27 µs ± 4.95 ns before('137M1I3M1D9M'): 2.87 µs ± 45.9 ns after('137M1I3M1D9M'): 3.48 µs ± 48.9 ns

Step 8: (DON'T) Use regular expression.

It is a pity that with regular expression there is a very elegant solution which is also more consistent with the official documentation. However its performance is worse.

p = re.compile(r'([0-9]+)([MIDNSHPX=])')
def after(cigar):
    align, offset = 0, 0
    for n, op in p.findall(cigar):
        if op in 'M=X':
            i = int(n)
            align += i
            offset += i
        elif op in 'DN':
            offset += int(n)
    return align, offset

before('150M'): 1 µs ± 9.81 ns after('150M'): 1.71 µs ± 17.5 ns before('137M1I3M1D9M'): 2.87 µs ± 45.9 ns after('137M1I3M1D9M'): 4.18 µs ± 8.8 ns

(Also tested finditer and performance is even worse.)

Step 9: Tweak arithmetic.

def after(cigar):
    align, offset = 0, 0
    n = ''
    for c in cigar:
        if c in 'MDIHNPSX=':
            if c in 'M=X':
                align += int(n)
            elif c in 'DN':
                offset += int(n)
            n = ''
        else:
            n += c
    return align, align + offset

before('150M'): 1 µs ± 9.81 ns after('150M'): 997 ns ± 7.88 ns before('137M1I3M1D9M'): 2.87 µs ± 45.9 ns after('137M1I3M1D9M'): 2.77 µs ± 18.4 ns

Final result

Total runtime:

  • Before optimization: 0m15.341s.
  • After optimization: 0m14.946s.
  • Don't parse: 0m12.253s.

qiyunzhu avatar Apr 12 '20 23:04 qiyunzhu

Update: Because some CIGAR strings show up frequently (e.g., 150M), one can use functools.lru_cache to significantly accelerate the calculation. Simply add a decorator: @lru_cache(maxsize=128) before the function.

Benchmarks on a real SHOGUN / Bowtie2 alignment file, which has 1342487 alignments:

CIGAR parsing alone:

  • Before: 714 ms ± 18.4 ms per loop
  • After: 85.3 ms ± 1.51 ms per loop

Benchmark on 100000 SAM alignments:

entire plain_mapper function:

  • Before: 262 ms ± 2.85 ms per loop
  • After: 214 ms ± 1.26 ms per loop

See PR #61

qiyunzhu avatar Jun 27 '20 07:06 qiyunzhu