antlr4 icon indicating copy to clipboard operation
antlr4 copied to clipboard

[Performance] [Python] Use bitwise check instead of in operator within list

Open KvanTTT opened this issue 2 years ago • 8 comments

Grammar

e0: e2 | T0;
e2: T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | T9 | T10 | T11 | T12 | T13 | T14 | T15;
T0: 't0';
T1: 't1';
T2: 't2';
T3: 't3';
T4: 't4';
T5: 't5';
T6: 't6';
T7: 't7';
T8: 't8';
T9: 't9';
T10: 't10';
T11: 't11';
T12: 't12';
T13: 't13';
T14: 't14';
T15: 't15';

Generated code

ANTLR generates the following code for e0:

def e0(self):
    localctx = TestParser.E0Context(self, self._ctx, self.state)
    self.enterRule(localctx, 2, self.RULE_e0)
    try:
        self.state = 10
        self._errHandler.sync(self)
        token = self._input.LA(1)
        if token in [TestParser.T1, TestParser.T2, TestParser.T3, TestParser.T4, TestParser.T5, TestParser.T6, TestParser.T7, TestParser.T8, TestParser.T9, TestParser.T10, TestParser.T11, TestParser.T12, TestParser.T13, TestParser.T14, TestParser.T15]:
            self.enterOuterAlt(localctx, 1)
            self.state = 8
            self.e2()
            pass
        elif token in [TestParser.T0]:
            self.enterOuterAlt(localctx, 2)
            self.state = 9
            self.match(TestParser.T0)
            pass
        else:
            raise NoViableAltException(self)
    except RecognitionException as re:
        localctx.exception = re
        self._errHandler.reportError(self, re)
        self._errHandler.recover(self, re)
    finally:
        self.exitRule()
    return localctx

The problem

The line if token in [TestParser.T1, TestParser.T2, TestParser.T3, TestParser.T4, TestParser.T5, TestParser.T6, TestParser.T7, TestParser.T8, TestParser.T9, TestParser.T10, TestParser.T11, TestParser.T12, TestParser.T13, TestParser.T14, TestParser.T15] is far from efficient because it allocates list on every e0 call and it has O(N) complexity that depends on tokens count. The more tokens to check the slower code we have.

It's quite common case and it's especially important for Python because Python doesn't support switch case construction and it's unable to optimize the code during intepretation. Maybe bitwise check also efficient for other runtimes as well but ANTLR generates switch case for them and it should be optimized by their compilers.

Solution

It can be replaced by bitwise checking, something like (1 << token) & 0xFFFF != 0. The similar check we have for TestSetInline.

Benchmark

https://gist.github.com/KvanTTT/e3b355f7e321fe7f52e11ea1aa0ecbce#file-check-range-vs-mask-py

check_by_if_test: 438 ns
check_by_range_test: 619 ns
check_by_mask_test: 202 ns

KvanTTT avatar May 08 '22 18:05 KvanTTT

Are you certain of your benchmark? The benefits seem to small compared to what it should be. Could you also compare with a basic constant unfolding version? Aka:

return i in [1,2,3,…,14,15]

The disassembly of this is quite interesting: the list is loaded as a single constant. It thus make it even smaller in size. And since the list is a constant, it's instanciated only once, at parsing time.

pinaraf avatar May 08 '22 18:05 pinaraf

Are you certain of your benchmark? The benefits seem to small compared to what it should be.

3 times performance improvement is small? You can try the benchmark yourself.

return i in [1,2,3,…,14,15]

ANTLR now uses refs instead plain int literals. It could affect generated bytecode. I'll investigate if it's enough to replace refs with literals.

KvanTTT avatar May 08 '22 18:05 KvanTTT

3 times performance improvement is small?

Well, yes. Compared to 15×2 = 30 dereferences, I was expecting one or two orders of magnitude, not just 1/3 of the time.

ANTLR now uses refs instead plain int literals.

Yeah… a terrible terrible terrible idea with Python where there is no constant, thus no optimization possible.

pinaraf avatar May 08 '22 18:05 pinaraf

Interesting, it looks like literals much more effective than refs, and maybe more effective than bitwise check:

check_by_if_test: 436 ns
check_by_range_test: 650 ns
check_by_range_with_literals_test: 176 ns
check_by_bitwise_test: 203 ns

Refs

Code

def check_by_range_test(i):
    return i in [C.C0, C.C1, C.C2, C.C3,
                 C.C4, C.C5, C.C6, C.C7,
                 C.C8, C.C9, C.C10, C.C11,
                 C.C12, C.C13, C.C14, C.C15]

Bytecode

65           0 LOAD_FAST                0 (i)
              2 LOAD_GLOBAL              0 (C)
              4 LOAD_ATTR                1 (C0)
              6 LOAD_GLOBAL              0 (C)
              8 LOAD_ATTR                2 (C1)
             10 LOAD_GLOBAL              0 (C)
             12 LOAD_ATTR                3 (C2)
             14 LOAD_GLOBAL              0 (C)
             16 LOAD_ATTR                4 (C3)

 66          18 LOAD_GLOBAL              0 (C)
             20 LOAD_ATTR                5 (C4)
             22 LOAD_GLOBAL              0 (C)
             24 LOAD_ATTR                6 (C5)
             26 LOAD_GLOBAL              0 (C)
             28 LOAD_ATTR                7 (C6)
             30 LOAD_GLOBAL              0 (C)
             32 LOAD_ATTR                8 (C7)

 67          34 LOAD_GLOBAL              0 (C)
             36 LOAD_ATTR                9 (C8)
             38 LOAD_GLOBAL              0 (C)
             40 LOAD_ATTR               10 (C9)
             42 LOAD_GLOBAL              0 (C)
             44 LOAD_ATTR               11 (C10)
             46 LOAD_GLOBAL              0 (C)
             48 LOAD_ATTR               12 (C11)

 68          50 LOAD_GLOBAL              0 (C)
             52 LOAD_ATTR               13 (C12)
             54 LOAD_GLOBAL              0 (C)
             56 LOAD_ATTR               14 (C13)
             58 LOAD_GLOBAL              0 (C)
             60 LOAD_ATTR               15 (C14)
             62 LOAD_GLOBAL              0 (C)
             64 LOAD_ATTR               16 (C15)

 65          66 BUILD_TUPLE             16
             68 COMPARE_OP               6 (in)
             70 RETURN_VALUE

Literals

Code

def check_by_range_with_literals_test(i):
    return i in [0, 1, 2, 3,
                 4, 5, 6, 7,
                 8, 9, 10, 11,
                 12, 13, 14, 15]

Bytecode

literals:
 72           0 LOAD_FAST                0 (i)
              2 LOAD_CONST               1 ((0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15))
              4 COMPARE_OP               6 (in)
              6 RETURN_VALUE

KvanTTT avatar May 08 '22 19:05 KvanTTT

Well, yes. Compared to 15×2 = 30 dereferences, I was expecting one or two orders of magnitude, not just 1/3 of the time.

Maybe dereference is very fast operation but Python interpreter can't optimize expression with references.

Yeah… a terrible terrible terrible idea with Python where there is no constant, thus no optimization possible.

Agree, I suggest replacing all token names with literals for efficiency for Python.

KvanTTT avatar May 08 '22 19:05 KvanTTT

Your benchmark has indeed a bias. Since you are calling the function with a fixed value, you don't see the full picture. check_by_bitwise_test has a constant execution time, always 100ns on my system. check_by_if_test starts at 78ns and ends up at 465ns. check_by_range_test starts at 426ns and ends up at 490ns. check_by_range_with_literals_test starts at 55ns and ends up at 110ns.

So… bitwise is faster if there is enough elements. I think you can determine the value of "enough" by rolling a dice (I think it'll be too specific to be determined once)

pinaraf avatar May 08 '22 19:05 pinaraf

Anyway 110 and 100 is not a big difference and usually there are not so many tokens (up to 20). Constants are used not only in this places but in many others. Also, it's much simpler to replace refs with literals.

KvanTTT avatar May 08 '22 19:05 KvanTTT

Created a patch script that implements this optimization: https://gist.github.com/amykyta3/8285559e95074c4431c2836d78b36530

amykyta3 avatar May 08 '22 19:05 amykyta3