otp
otp copied to clipboard
Optimize the copy that occurs in ets:lookup_element
@michalmuskala pointed to me that sometimes ets:lookup_element can be slower than ets:lookup, even when the value is found. For example:
erlperf "ets:lookup(ac_tab, {env, kernel, logger})." "ets:lookup_element(ac_tab, {env, kernel, logger}, 2)."
Code || QPS Time Rel
ets:lookup(ac_tab, {env, kernel, logger}). 1 2859 Ki 349 ns 100%
ets:lookup_element(ac_tab, {env, kernel, logger}, 2). 1 2164 Ki 462 ns 75%
The reason for this is that lookup can find the size of the allocation needed directly in the DbTerm, then copy the whole thing efficiently with copy_shallow_x, whereas lookup_element must first walk all over the value to be copied in size_object_x, and then walk it again in copy_struct_x.
This patch fixes this in three steps:
- Make db_store_term (used by ets:insert among others) use a specialized version of copy_struct that makes the fields be laid in order (instead of being chaotically interspersed as usual)
- Finding the size of the needed allocation can now be done efficiently (see db_field_size)
- And we can use copy_shallow_x for the copy like lookup (just with a small change so that it does not assume that it is copying a tuple).
Result:
PATH=~/otp/bin:$PATH erlperf "ets:lookup(ac_tab, {env, kernel, logger})." "ets:lookup_element(ac_tab, {env, kernel, logger}, 2)."
Code || QPS Time Rel
ets:lookup_element(ac_tab, {env, kernel, logger}, 2). 1 2945 Ki 339 ns 100%
ets:lookup(ac_tab, {env, kernel, logger}). 1 2860 Ki 349 ns 97%
I've tried measuring the potential impact on ets:insert performance by looking at the 2 throughput benchmarks in ets_SUITE. More specifically, I looked at the 50% insert/50% delete mix (since it maximizes the impact of insert) and at the 1 process case (since my change should not affect concurrency in any way). Here are the results for the long benchmark, with 3 runs in each configuration, listing all three kinds of table tested: [set, {write_concurrency, auto}, {read_concurrency, true}]:
- Baseline: 3.55 / 3.31 / 2.85
- This patch: 3.20 / 3.08 / 3.44
[set, {write_concurrency, true}, {read_concurrency, true}]:
- Baseline: 3.31 / 3.11 / 2.78
- This patch: 3.13 / 3.08 / 3.42
[ordered_set, {write_concurrency, true}, {read_concurrency, true}]:
- Baseline: 1.51 / 1.56 / 1.17
- This patch: 1.50 / 1.37 / 1.59
(all running on my M1 MBP)
There is no obvious performance regression, but the noise is sufficiently high that I may have missed some. If anyone has a suggestion for how to get numbers we can trust of the impact on insert, I'd be happy to hear it.
CT Test Results
4 files 210 suites 1h 26m 36s :stopwatch: 3 284 tests 3 188 :heavy_check_mark: 96 :zzz: 0 :x: 4 011 runs 3 893 :heavy_check_mark: 118 :zzz: 0 :x:
Results for commit 3dbb7291.
:recycle: This comment has been updated with latest results.
To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.
See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.
Artifacts
// Erlang/OTP Github Action Bot
Thanks for the contribution.
I pushed a memory leak fix in copy_struct_in_order_x
to your copy
branch.
Question: Did you contemplate looping over copy_struct
for each element instead of implementing copy_struct_in_order_x
?
Thank you for catching the memory leak, I'm sorry for missing it.
I did not think to just loop over copy_struct_x, I agree that it sounds much simpler, I'll try that right now.
Thank you for catching the memory leak, I'm sorry for missing it.
I just ran ets_SUITE
with debug
compiled emulator.
> cd $ERL_TOP
> (cd erts/emulator && make debug)
> bin/cerl -debug
For more development tips and tools see https://www.erlang.org/doc/tutorial/debugging.html
It turns out that just calling copy_struct for each field of the tuple is actually very tricky.
The problem I found is lists: the beginning of the list is not guaranteed by copy_struct to be at the beginning of the space allocated for the field. For example, the tuple {x, "AA"}
has the following representation if I store it in ETS with repeated calls to copy_struct:
OTHER at 0x1465ad398, 0x80 (tuple header, arity = 2)
IMMED1 at 0x1465ad3a0, 628811 (atom x)
LIST at 0x1465ad3a8, 0x1465ad3c0 (first cons)
IMMED1 at 0x1465ad3b0, 1055 (A)
IMMED1 at 0x1465ad3b8, 59 (NIL)
IMMED1 at 0x1465ad3c0, 1055 (A)
LIST at 0x1465ad3c8, 0x1465ad3b0 (second cons)
Note that obj->tpl[2] does point to "AA", but it starts near the end of its allocated space (0x1465ad3c0) and not the beginning (0x1465ad3b0). This results in db_field_size getting the wrong size for the field, and only part of it is copied. I don't see how to fix this, this whole patch relies on:
- each field is allocated as a block
- the tuple at the top has pointers to the beginning of each block
I'll document it in an updated commit message/comments.
As for performance, I've figured how to do benchmarking on our internal workload, I'll use it to verify that my original approach does not add a huge cost to ets:insert.
Oh, I didn't think about the putting-things-last optimization in copy_struct. Hm....
I'm sorry if I persist skeptical of your quite impressive copy_struct_in_order_x()
, but as a maintainer I'm a bit stickler about keeping the code volume down if possible.
How about something like this, where we only give special treatment to the list case and delegate the rest to standard copy_struct:
Eterm copy_ets_element(Eterm obj, int sz, Eterm **hpp, ErlOffHeap *off_heap)
{
if (is_list(obj) && is_immed(CAR(list_val(obj)))) {
/* copy_struct() would put this last, but we need the top term to be first in block */
Eterm* src = list_val(obj);
Eterm* dst = *hpp;
CAR(dst) = CAR(src);
*hpp += 2;
CDR(dst) = copy_struct(CDR(src), sz-2, hpp, off_heap);
return make_list(dst);
}
else {
return copy_struct(obj, sz, hpp, off_heap);
}
}
There are other cases in copy_struct()
where things are put last in the block (I searched for hbot
), but they are all leaf terms which means they will also be the only term and therefor also be first in the block (I think).
Thanks, I tried and it seems to work! I'll do more testing and benchmarking and comment again.
@RobinMorisset ping
@RobinMorisset Any change you could finish this soon? It would be preferable to include it in the first release candidate (with deadline February 10).
Fixed bug causing failed debug assert in ets_SUITE:lookup_element_mult
.
@RobinMorisset
I've fixed some problems with ets:update_element
and the empty tuple (which is always a literal).
Now I'm satisfied and will let it run at least one night before squash and merge.
@sverker Thank you very much for this, and sorry for dropping the ball on this PR. I had been investigating a bug found in it by our internal tests for the past few days, and it turned out to be caused by the empty tuple bug that you just fixed.