otp Further optimize binary matching

This pull request builds on the improvements in #6259 to further optimize binary matching. See the individual commit messages for more details.

Oct 13 '22 06:10 bjorng

CT Test Results

      4 files   406 suites 52m 3s :stopwatch: 2 076 tests 2 027 :heavy_check_mark: 49 :zzz: 0 :x: 6 361 runs 6 294 :heavy_check_mark: 67 :zzz: 0 :x:

Results for commit 049c73a2.

:recycle: This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

Oct 13 '22 06:10 github-actions[bot]

Would this optimise something like the following (which is how I'd decode varints or Unsigned LEB128 - common in many compact networking protocols):

varint(<<0:1, Len:7, Rest/bits>>) ->
    continue(Rest, Len);
varint(<<1:1, Len1:7, 0:1, Len2:7, Rest/bits>>) ->
    Len = (Len2 bsl 7) bor Len1,
    continue(Rest, Len);
...

Oct 17 '22 08:10 michalmuskala

@michalmuskala Yes, this code will be nicely optimized by this pull request and #6259.

Internally in the compiler, your example would be rewritten to something similar like this:

varint(Bin) ->
    case Bin of
        <<X:1, Tail/bits>> ->
            case X of
                0 ->
                    case Tail of
                        <<Len:7, Rest/bits>> ->
                            continue(Rest, Len)
                    end;
                1 ->
                    case Tail of
                        <<Len1:7, 0:1, Len2:7, Rest/bits>> ->
                            Len = (Len2 bsl 7) bor Len1,
                            continue(Rest, Len)
                    end
            end
    end.

#6259 makes matching of segments having "odd" sizes such as 1 and 7 more efficient than in OTP 25.

This pull request will make the matching of <Len1:7, 0:1, Len2:7, Rest/bits>> more efficient. The entire match will be contained in one bs_match instruction:

    {bs_match,{f,4},
              {x,1},
              {commands,[{ensure_at_least,15,1},
                         {integer,2,{literal,[]},7,1,{x,2}},
                         {'=:=',3,1,0},
                         {integer,3,{literal,[]},7,1,{x,0}},
                         {get_tail,3,1,{x,1}}]}}.

Before this pull request, there would be two bs_match instructions with a bs_match_string instruction in between:

    {bs_match,{f,4},
              {x,1},
              {commands,[{ensure_at_least,7,1},
                         {integer,2,{literal,[]},7,1,{x,2}}]}}.
    {test,bs_match_string,{f,4},[{x,1},1,{string,<<0>>}]}.
    {bs_match,{f,4},
              {x,1},
              {commands,[{ensure_at_least,7,1},
                         {integer,3,{literal,[]},7,1,{x,0}},
                         {get_tail,3,1,{x,1}}]}}.

Oct 17 '22 11:10 bjorng

This looks great! Thank you for explaining.

It's a shame the two sub-byte reads are separated - I assume always reading full bytes is more efficient. In this case in particular it would be better to read <<Tag:1, Len:7, ...>> since we'll need the Len in both branches either way.

Is this something it would make sense to optimise in the compiler? Or perhaps structuring the code differently could help with that

Oct 17 '22 15:10 michalmuskala

It's a shame the two sub-byte reads are separated - I assume always reading full bytes is more efficient. In this case in particular it would be better to read <<Tag:1, Len:7, ...>> since we'll need the Len in both branches either way.

Well, doing one read instead of two reads is faster. (Only one read instead of two would be needed for extracting a "short" length.) Since #6259, there is less difference between reading whole bytes and parts of bytes or doing unaligned reads.

It is in fact possible to restructure the source code so we avoid separating the first two sub-byte reads:

varint(Bin) ->
    case Bin of
        <<X:1, Len1:7, Tail/bits>> ->
            case X of
                0 ->
                    continue(Tail, Len1);
                1 ->
                    case Tail of
                        <<0:1, Len2:7, Rest/bits>> ->
                            Len = (Len2 bsl 7) bor Len1,
                            continue(Rest, Len)
                    end
            end
    end.

I suppose that it would be possible for the compiler to do that kind of transformation by itself. I don't know how tricky that would be or how much it would give.

Oct 17 '22 17:10 bjorng

otp otp copied to clipboard

Further optimize binary matching

CT Test Results

Artifacts

otp
otp copied to clipboard