antlr4
antlr4 copied to clipboard
[Performance] [Python] Use bitwise check instead of in operator within list
Grammar
e0: e2 | T0;
e2: T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | T9 | T10 | T11 | T12 | T13 | T14 | T15;
T0: 't0';
T1: 't1';
T2: 't2';
T3: 't3';
T4: 't4';
T5: 't5';
T6: 't6';
T7: 't7';
T8: 't8';
T9: 't9';
T10: 't10';
T11: 't11';
T12: 't12';
T13: 't13';
T14: 't14';
T15: 't15';
Generated code
ANTLR generates the following code for e0
:
def e0(self):
localctx = TestParser.E0Context(self, self._ctx, self.state)
self.enterRule(localctx, 2, self.RULE_e0)
try:
self.state = 10
self._errHandler.sync(self)
token = self._input.LA(1)
if token in [TestParser.T1, TestParser.T2, TestParser.T3, TestParser.T4, TestParser.T5, TestParser.T6, TestParser.T7, TestParser.T8, TestParser.T9, TestParser.T10, TestParser.T11, TestParser.T12, TestParser.T13, TestParser.T14, TestParser.T15]:
self.enterOuterAlt(localctx, 1)
self.state = 8
self.e2()
pass
elif token in [TestParser.T0]:
self.enterOuterAlt(localctx, 2)
self.state = 9
self.match(TestParser.T0)
pass
else:
raise NoViableAltException(self)
except RecognitionException as re:
localctx.exception = re
self._errHandler.reportError(self, re)
self._errHandler.recover(self, re)
finally:
self.exitRule()
return localctx
The problem
The line if token in [TestParser.T1, TestParser.T2, TestParser.T3, TestParser.T4, TestParser.T5, TestParser.T6, TestParser.T7, TestParser.T8, TestParser.T9, TestParser.T10, TestParser.T11, TestParser.T12, TestParser.T13, TestParser.T14, TestParser.T15]
is far from efficient because it allocates list on every e0
call and it has O(N) complexity that depends on tokens count. The more tokens to check the slower code we have.
It's quite common case and it's especially important for Python because Python doesn't support switch case
construction and it's unable to optimize the code during intepretation. Maybe bitwise check also efficient for other runtimes as well but ANTLR generates switch case
for them and it should be optimized by their compilers.
Solution
It can be replaced by bitwise checking, something like (1 << token) & 0xFFFF != 0
. The similar check we have for TestSetInline
.
Benchmark
https://gist.github.com/KvanTTT/e3b355f7e321fe7f52e11ea1aa0ecbce#file-check-range-vs-mask-py
check_by_if_test: 438 ns
check_by_range_test: 619 ns
check_by_mask_test: 202 ns
Are you certain of your benchmark? The benefits seem to small compared to what it should be. Could you also compare with a basic constant unfolding version? Aka:
return i in [1,2,3,…,14,15]
The disassembly of this is quite interesting: the list is loaded as a single constant. It thus make it even smaller in size. And since the list is a constant, it's instanciated only once, at parsing time.
Are you certain of your benchmark? The benefits seem to small compared to what it should be.
3 times performance improvement is small? You can try the benchmark yourself.
return i in [1,2,3,…,14,15]
ANTLR now uses refs instead plain int literals. It could affect generated bytecode. I'll investigate if it's enough to replace refs with literals.
3 times performance improvement is small?
Well, yes. Compared to 15×2 = 30 dereferences, I was expecting one or two orders of magnitude, not just 1/3 of the time.
ANTLR now uses refs instead plain int literals.
Yeah… a terrible terrible terrible idea with Python where there is no constant, thus no optimization possible.
Interesting, it looks like literals much more effective than refs, and maybe more effective than bitwise check:
check_by_if_test: 436 ns
check_by_range_test: 650 ns
check_by_range_with_literals_test: 176 ns
check_by_bitwise_test: 203 ns
Refs
Code
def check_by_range_test(i):
return i in [C.C0, C.C1, C.C2, C.C3,
C.C4, C.C5, C.C6, C.C7,
C.C8, C.C9, C.C10, C.C11,
C.C12, C.C13, C.C14, C.C15]
Bytecode
65 0 LOAD_FAST 0 (i)
2 LOAD_GLOBAL 0 (C)
4 LOAD_ATTR 1 (C0)
6 LOAD_GLOBAL 0 (C)
8 LOAD_ATTR 2 (C1)
10 LOAD_GLOBAL 0 (C)
12 LOAD_ATTR 3 (C2)
14 LOAD_GLOBAL 0 (C)
16 LOAD_ATTR 4 (C3)
66 18 LOAD_GLOBAL 0 (C)
20 LOAD_ATTR 5 (C4)
22 LOAD_GLOBAL 0 (C)
24 LOAD_ATTR 6 (C5)
26 LOAD_GLOBAL 0 (C)
28 LOAD_ATTR 7 (C6)
30 LOAD_GLOBAL 0 (C)
32 LOAD_ATTR 8 (C7)
67 34 LOAD_GLOBAL 0 (C)
36 LOAD_ATTR 9 (C8)
38 LOAD_GLOBAL 0 (C)
40 LOAD_ATTR 10 (C9)
42 LOAD_GLOBAL 0 (C)
44 LOAD_ATTR 11 (C10)
46 LOAD_GLOBAL 0 (C)
48 LOAD_ATTR 12 (C11)
68 50 LOAD_GLOBAL 0 (C)
52 LOAD_ATTR 13 (C12)
54 LOAD_GLOBAL 0 (C)
56 LOAD_ATTR 14 (C13)
58 LOAD_GLOBAL 0 (C)
60 LOAD_ATTR 15 (C14)
62 LOAD_GLOBAL 0 (C)
64 LOAD_ATTR 16 (C15)
65 66 BUILD_TUPLE 16
68 COMPARE_OP 6 (in)
70 RETURN_VALUE
Literals
Code
def check_by_range_with_literals_test(i):
return i in [0, 1, 2, 3,
4, 5, 6, 7,
8, 9, 10, 11,
12, 13, 14, 15]
Bytecode
literals:
72 0 LOAD_FAST 0 (i)
2 LOAD_CONST 1 ((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15))
4 COMPARE_OP 6 (in)
6 RETURN_VALUE
Well, yes. Compared to 15×2 = 30 dereferences, I was expecting one or two orders of magnitude, not just 1/3 of the time.
Maybe dereference is very fast operation but Python interpreter can't optimize expression with references.
Yeah… a terrible terrible terrible idea with Python where there is no constant, thus no optimization possible.
Agree, I suggest replacing all token names with literals for efficiency for Python.
Your benchmark has indeed a bias. Since you are calling the function with a fixed value, you don't see the full picture.
check_by_bitwise_test
has a constant execution time, always 100ns on my system.
check_by_if_test
starts at 78ns and ends up at 465ns.
check_by_range_test
starts at 426ns and ends up at 490ns.
check_by_range_with_literals_test
starts at 55ns and ends up at 110ns.
So… bitwise is faster if there is enough elements. I think you can determine the value of "enough" by rolling a dice (I think it'll be too specific to be determined once)
Anyway 110
and 100
is not a big difference and usually there are not so many tokens (up to 20). Constants are used not only in this places but in many others. Also, it's much simpler to replace refs with literals.
Created a patch script that implements this optimization: https://gist.github.com/amykyta3/8285559e95074c4431c2836d78b36530