woltka
woltka copied to clipboard
Optimize cigar_to_lens function
What is CIGAR and why optimize
The function cigar_to_lens
is to calculate the length of an alignment from the CIGAR string in a SAM file. The specification of SAM and CIGAR can be found here. This function needs to be executed for numerous times in an analysis (once per line), and is among the bottlenecking steps.
In a test on the medium test dataset S01
:
- Read SAM file but don't parse: 0m9.875s.
- Read and parse SAM file, but skip CIGAR: 0m12.172s.
- Read and parse SAM file, including CIGAR: 0m15.372s.
Therefore it is worth to further optimize this function.
Frequencies of strings and operators
In test dataset S01
(generated using SHOGUN/Bowtie2), there are 1342487 lines, of which the top 10 most frequent CIGAR strings are:
1325416 150M
2473 146M1D4M
1378 145M1D5M
684 144M1D6M
636 4M1I145M
552 145M1I4M
364 144M1I5M
353 143M1D7M
268 5M1I144M
211 143M1I6M
The frequencies of the operator codes are:
M 1364229
D 11833
I 9909
Original function
def cigar_to_lens(cigar):
"""Extract lengths from a CIGAR string.
Parameters
----------
cigar : str
CIGAR string.
Returns
-------
int
Alignment length.
int
Offset in subject sequence.
Raises
------
ValueError
CIGAR string is missing.
"""
if cigar in ('', '*'):
raise ValueError('Missing CIGAR string.')
align, offset = 0, 0
ops = 'DHIMNPSX='
n = '' # current step size
for c in cigar:
if c in ops:
if c in 'M=X':
align += int(n)
if c in 'MDN=X':
offset += int(n)
n = ''
else:
n += c
return align, offset
Step 1: Remove validity check.
There is no need to test for a invalid CIGAR string as long as the SAM file is valid.
%timeit before('150M')
1.43 µs ± 7.41 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit after('150M')
1.3 µs ± 6.56 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Step 2: Suppress variable / constant.
Instead of:
ops = 'DHIMNPSX='
for c in cigar:
if c in ops:
Do:
for c in cigar:
if c in 'DHIMNPSX=':
Also test global constant outside function.
%timeit before('150M')
1.26 µs ± 5.17 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit after('150M')
1.21 µs ± 4.61 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit global('150M')
1.35 µs ± 5.03 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Step 3: Sort operator codes by frequency.
So that M
. D
and I
come first:
%timeit 'M' in 'DHIMNPSX='
40.2 ns ± 0.0414 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit 'M' in 'MDIHNPSX='
38.4 ns ± 0.0271 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Step 4: (DON'T) Replace string with tuple.
This questions depends on actual scenarios. Sometimes it gains performance and sometimes it loses. In our scenario, replacing string with tuple results in a gain of performance if query is in the string, however there is a severe loss of performance if not. Overall, since a big proportion of CIGAR characters are NOT operators, this move is not favorable.
'M' in 'MDIHNPSX='
42.5 ns ± 0.0288 ns
'M' in ('M', 'D', 'I', 'H', 'N', 'P', 'S', 'X', '=')
29.8 ns ± 0.0186 ns
'1' in 'MDIHNPSX='
45.3 ns ± 0.317 ns
'1' in ('M', 'D', 'I', 'H', 'N', 'P', 'S', 'X', '=')
232 ns ± 8.12 ns
Step 5: (DON'T) Use str.find()
.
'M' in 'MX='
38.1 ns ± 0.19 ns
'D' in 'MX='
39.6 ns ± 0.165 ns
'I' in 'MX='
39.4 ns ± 0.0296 ns
'MX='.find('M')
: 178 ns ± 0.315 ns
'MX='.find('D')
: 184 ns ± 0.0941 ns
'MX='.find('I')
: 183 ns ± 0.224 ns
Step 6: Optimize switch cases.
def before(cigar):
...
if c in 'M=X':
align += int(n)
if c in 'MDN=X':
offset += int(n)
...
def after(cigar):
...
if c in 'M=X':
i = int(n)
align += i
offset += i
elif c in 'DN':
offset += int(n)
...
before('150M')
: 1.26 µs ± 6 ns
after('150M')
: 1 µs ± 10.8 ns
before('137M1I3M1D9M')
: 3.53 µs ± 33.5 ns
after('137M1I3M1D9M')
: 2.85 µs ± 12.6 ns
Step 7: (DON'T) Use str.isdigit()
.
def after(cigar):
...
if c.isdigit():
n += c
else:
...
n = ''
...
before('150M')
: 1 µs ± 9.81 ns
after('150M')
: 1.27 µs ± 4.95 ns
before('137M1I3M1D9M')
: 2.87 µs ± 45.9 ns
after('137M1I3M1D9M')
: 3.48 µs ± 48.9 ns
Step 8: (DON'T) Use regular expression.
It is a pity that with regular expression there is a very elegant solution which is also more consistent with the official documentation. However its performance is worse.
p = re.compile(r'([0-9]+)([MIDNSHPX=])')
def after(cigar):
align, offset = 0, 0
for n, op in p.findall(cigar):
if op in 'M=X':
i = int(n)
align += i
offset += i
elif op in 'DN':
offset += int(n)
return align, offset
before('150M')
: 1 µs ± 9.81 ns
after('150M')
: 1.71 µs ± 17.5 ns
before('137M1I3M1D9M')
: 2.87 µs ± 45.9 ns
after('137M1I3M1D9M')
: 4.18 µs ± 8.8 ns
(Also tested finditer
and performance is even worse.)
Step 9: Tweak arithmetic.
def after(cigar):
align, offset = 0, 0
n = ''
for c in cigar:
if c in 'MDIHNPSX=':
if c in 'M=X':
align += int(n)
elif c in 'DN':
offset += int(n)
n = ''
else:
n += c
return align, align + offset
before('150M')
: 1 µs ± 9.81 ns
after('150M')
: 997 ns ± 7.88 ns
before('137M1I3M1D9M')
: 2.87 µs ± 45.9 ns
after('137M1I3M1D9M')
: 2.77 µs ± 18.4 ns
Final result
Total runtime:
- Before optimization: 0m15.341s.
- After optimization: 0m14.946s.
- Don't parse: 0m12.253s.
Update: Because some CIGAR strings show up frequently (e.g., 150M
), one can use functools.lru_cache
to significantly accelerate the calculation. Simply add a decorator: @lru_cache(maxsize=128)
before the function.
Benchmarks on a real SHOGUN / Bowtie2 alignment file, which has 1342487 alignments:
CIGAR parsing alone:
- Before: 714 ms ± 18.4 ms per loop
- After: 85.3 ms ± 1.51 ms per loop
Benchmark on 100000 SAM alignments:
entire plain_mapper
function:
- Before: 262 ms ± 2.85 ms per loop
- After: 214 ms ± 1.26 ms per loop
See PR #61