otp icon indicating copy to clipboard operation
otp copied to clipboard

Optimize the copy that occurs in ets:lookup_element

Open RobinMorisset opened this issue 2 years ago • 7 comments

@michalmuskala pointed to me that sometimes ets:lookup_element can be slower than ets:lookup, even when the value is found. For example:

erlperf "ets:lookup(ac_tab, {env, kernel, logger})." "ets:lookup_element(ac_tab, {env, kernel, logger}, 2)."
Code                                                          ||        QPS       Time     Rel
ets:lookup(ac_tab, {env, kernel, logger}).                     1    2859 Ki     349 ns    100%
ets:lookup_element(ac_tab, {env, kernel, logger}, 2).          1    2164 Ki     462 ns     75%

The reason for this is that lookup can find the size of the allocation needed directly in the DbTerm, then copy the whole thing efficiently with copy_shallow_x, whereas lookup_element must first walk all over the value to be copied in size_object_x, and then walk it again in copy_struct_x.

This patch fixes this in three steps:

  • Make db_store_term (used by ets:insert among others) use a specialized version of copy_struct that makes the fields be laid in order (instead of being chaotically interspersed as usual)
  • Finding the size of the needed allocation can now be done efficiently (see db_field_size)
  • And we can use copy_shallow_x for the copy like lookup (just with a small change so that it does not assume that it is copying a tuple).

Result:

PATH=~/otp/bin:$PATH erlperf "ets:lookup(ac_tab, {env, kernel, logger})." "ets:lookup_element(ac_tab, {env, kernel, logger}, 2)."
Code                                                          ||        QPS       Time     Rel
ets:lookup_element(ac_tab, {env, kernel, logger}, 2).          1    2945 Ki     339 ns    100%
ets:lookup(ac_tab, {env, kernel, logger}).                     1    2860 Ki     349 ns     97%

I've tried measuring the potential impact on ets:insert performance by looking at the 2 throughput benchmarks in ets_SUITE. More specifically, I looked at the 50% insert/50% delete mix (since it maximizes the impact of insert) and at the 1 process case (since my change should not affect concurrency in any way). Here are the results for the long benchmark, with 3 runs in each configuration, listing all three kinds of table tested: [set, {write_concurrency, auto}, {read_concurrency, true}]:

  • Baseline: 3.55 / 3.31 / 2.85
  • This patch: 3.20 / 3.08 / 3.44

[set, {write_concurrency, true}, {read_concurrency, true}]:

  • Baseline: 3.31 / 3.11 / 2.78
  • This patch: 3.13 / 3.08 / 3.42

[ordered_set, {write_concurrency, true}, {read_concurrency, true}]:

  • Baseline: 1.51 / 1.56 / 1.17
  • This patch: 1.50 / 1.37 / 1.59

(all running on my M1 MBP)

There is no obvious performance regression, but the noise is sufficiently high that I may have missed some. If anyone has a suggestion for how to get numbers we can trust of the impact on insert, I'd be happy to hear it.

RobinMorisset avatar Sep 01 '22 16:09 RobinMorisset

CT Test Results

       4 files     210 suites   1h 26m 36s :stopwatch: 3 284 tests 3 188 :heavy_check_mark:   96 :zzz: 0 :x: 4 011 runs  3 893 :heavy_check_mark: 118 :zzz: 0 :x:

Results for commit 3dbb7291.

:recycle: This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

github-actions[bot] avatar Sep 01 '22 16:09 github-actions[bot]

Thanks for the contribution.

I pushed a memory leak fix in copy_struct_in_order_x to your copy branch.

Question: Did you contemplate looping over copy_struct for each element instead of implementing copy_struct_in_order_x?

sverker avatar Sep 27 '22 16:09 sverker

Thank you for catching the memory leak, I'm sorry for missing it.

I did not think to just loop over copy_struct_x, I agree that it sounds much simpler, I'll try that right now.

RobinMorisset avatar Sep 29 '22 14:09 RobinMorisset

Thank you for catching the memory leak, I'm sorry for missing it.

I just ran ets_SUITE with debug compiled emulator.

> cd $ERL_TOP
> (cd erts/emulator && make debug)
> bin/cerl -debug

For more development tips and tools see https://www.erlang.org/doc/tutorial/debugging.html

sverker avatar Sep 29 '22 17:09 sverker

It turns out that just calling copy_struct for each field of the tuple is actually very tricky. The problem I found is lists: the beginning of the list is not guaranteed by copy_struct to be at the beginning of the space allocated for the field. For example, the tuple {x, "AA"} has the following representation if I store it in ETS with repeated calls to copy_struct:

        OTHER at 0x1465ad398, 0x80 (tuple header, arity = 2)
	IMMED1 at 0x1465ad3a0, 628811 (atom x)
	LIST at 0x1465ad3a8, 0x1465ad3c0 (first cons)
	IMMED1 at 0x1465ad3b0, 1055 (A)
	IMMED1 at 0x1465ad3b8, 59 (NIL)
	IMMED1 at 0x1465ad3c0, 1055 (A)
	LIST at 0x1465ad3c8, 0x1465ad3b0 (second cons)

Note that obj->tpl[2] does point to "AA", but it starts near the end of its allocated space (0x1465ad3c0) and not the beginning (0x1465ad3b0). This results in db_field_size getting the wrong size for the field, and only part of it is copied. I don't see how to fix this, this whole patch relies on:

  • each field is allocated as a block
  • the tuple at the top has pointers to the beginning of each block

I'll document it in an updated commit message/comments.

As for performance, I've figured how to do benchmarking on our internal workload, I'll use it to verify that my original approach does not add a huge cost to ets:insert.

RobinMorisset avatar Oct 03 '22 15:10 RobinMorisset

Oh, I didn't think about the putting-things-last optimization in copy_struct. Hm....

I'm sorry if I persist skeptical of your quite impressive copy_struct_in_order_x(), but as a maintainer I'm a bit stickler about keeping the code volume down if possible.

How about something like this, where we only give special treatment to the list case and delegate the rest to standard copy_struct:

Eterm copy_ets_element(Eterm obj, int sz, Eterm **hpp, ErlOffHeap *off_heap)
{
    if (is_list(obj) && is_immed(CAR(list_val(obj)))) {
        /* copy_struct() would put this last, but we need the top term to be first in block */
        Eterm* src = list_val(obj);
        Eterm* dst = *hpp;

        CAR(dst) = CAR(src);
        *hpp += 2;
        CDR(dst) = copy_struct(CDR(src), sz-2, hpp, off_heap);
        return make_list(dst);
    }
    else {
        return copy_struct(obj, sz, hpp, off_heap);
    }
}

There are other cases in copy_struct() where things are put last in the block (I searched for hbot), but they are all leaf terms which means they will also be the only term and therefor also be first in the block (I think).

sverker avatar Oct 03 '22 21:10 sverker

Thanks, I tried and it seems to work! I'll do more testing and benchmarking and comment again.

RobinMorisset avatar Oct 04 '22 09:10 RobinMorisset

@RobinMorisset ping

sverker avatar Nov 18 '22 17:11 sverker

@RobinMorisset Any change you could finish this soon? It would be preferable to include it in the first release candidate (with deadline February 10).

bjorng avatar Jan 31 '23 13:01 bjorng

Fixed bug causing failed debug assert in ets_SUITE:lookup_element_mult.

sverker avatar Feb 09 '23 19:02 sverker

@RobinMorisset I've fixed some problems with ets:update_element and the empty tuple (which is always a literal). Now I'm satisfied and will let it run at least one night before squash and merge.

sverker avatar Feb 21 '23 14:02 sverker

@sverker Thank you very much for this, and sorry for dropping the ball on this PR. I had been investigating a bug found in it by our internal tests for the past few days, and it turned out to be caused by the empty tuple bug that you just fixed.

RobinMorisset avatar Feb 21 '23 17:02 RobinMorisset