otp Use in-memory atom ordering for map ordering

Use in-memory atom ordering for map ordering

Open josevalim opened this issue 1 year ago • 22 comments

This should optimize all operations on small maps that rely on ordering. As an example, creating a map with 32 keys named 'atom-INDEX', where INDEX is an integer, is an order of magnitude faster with this patch.

Before:

from_list      624.32 K        1.60 μs   ±723.44%        1.54 μs        1.71 μs

After:

from_list        6.23 M      160.47 ns ±15898.23%         125 ns         167 ns

It has been verified that the distribution and ETS will respect the atom ordering of the system and therefore are backwards compatible.

A potential downside is that the order of atom fields in maps is no longer deterministic, but that's already the case for large maps, and having non-deterministic atom ordering may help avoid code that accidentally relies on order.

This patch also opens up the possibility for more efficient map lookups in the future.

Jul 13 '22 14:07 josevalim

I am pretty sure CI will break, but I am sending a PR to assess how big the damage will be. :)

Jul 13 '22 14:07 josevalim

CT Test Results

      5 files   262 suites 2h 16m 48s :stopwatch: 4 519 tests 4 199 :heavy_check_mark: 317 :zzz: 3 :x: 5 448 runs 5 073 :heavy_check_mark: 372 :zzz: 3 :x:

For more details on these failures, see this check.

Results for commit 54337ba0.

:recycle: This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

Jul 13 '22 14:07 github-actions[bot]

Could this be a emulator runtime or compile time flag? I can think of a few cases where the order of atoms is very much appreciated, mostly in tooling, such as not having to manually override print functions of maps/structs (members automatically sorted) when inspecting them in logs/console.

Also would this affect non-atom keyed maps or mix of key type maps?

and having non-deterministic atom ordering may help avoid code that accidentally relies on order.

Well it wont help if you know your never expecting a map larger than a certain size. Say your parsing protobuf->map you know the packet will only ever have 4 fields based on the protodefs, and reading the packetlog with the fields always in different places wont work for debugging, so youl have to write a custom print function. Again its kind of a time saver for an untyped language, as it should be, poc and iterate asap.

Jul 13 '22 14:07 vans163

Also would this affect non-atom keyed maps or mix of key type maps?

No, it should not.

Well it wont help if you know your never expecting a map larger than a certain size.

Future Erlang/OTP versions can reduce the size or even change the small map implementation altogether to one that does not rely on ordering. So I can understand the benefit of ordering for debugging/printing, but for anything else you should not rely on it.

For printing, however, maybe we should consider looking for solutions that work both for small and large maps alike. For example, maybe we could have a flag on io:format/2 that orders the keys before-hand.

Jul 13 '22 14:07 josevalim

Summary of failures:

1 print related failure in stdlib
3 print related failures in dialyzer
1 unknown reason failure in ssh
1 unknown reason failure in kernel
1 regression in emulator (deterministic in term_to_binary needs to sort)

So it seems to have minimal impact so far.

Jul 13 '22 16:07 josevalim

Future Erlang/OTP versions can reduce the size or even change the small map implementation altogether to one that does not rely on ordering. So I can understand the benefit of ordering for debugging/printing, but for anything else you should not rely on it.

Indeed, its more of a convenience.

For printing, however, maybe we should consider looking for solutions that work both for small and large maps alike. For example, maybe we could have a flag on io:format/2 that orders the keys before-hand.

Yea that would work well, then the defimpl Inspect can be overwritten with said io:format flag so the REPL print shows it ordered too (on Elixir side, not sure how this is done on Erlang side).

The speed difference is very significant.

Would this affect the compare implementation between 2 atoms across the whole emu tho, beyond just maps?

Jul 13 '22 21:07 vans163

A potential downside is that the order of atom fields in maps is no longer deterministic, but that's already the case for large maps, and having non-deterministic atom ordering may help avoid code that accidentally relies on order.

Maybe I've misunderstood but is this patch generating code that breaks deterministic builds?

If it is, do you see a way to avoid such an impact, as the project is trying to avoid sources of non-determinism for deterministic builds.

Jul 14 '22 20:07 JeromeDeBretagne

@JeromeDeBretagne it should not break deterministic builds because the compiler should already take care of writing maps to BEAM files in a format that does not depend on the map representation (given large maps already have no ordering). I suspect this why I didn't see any failures related to this (only 7 failures altogether, which is quite impressive given OTP test suite size).

Jul 14 '22 20:07 josevalim

Thanks for your pull request.

I think that the OTP Technical Board will have to decide whether this change is acceptable from a backward-compatibility perspective. There could be applications or libraries that depend on maps:to_list(SmallMap) always returning a sorted list. I have no idea common that could be.

Regarding printing, I think that the easiest way to handle it is to modify io:format() to always sort maps when printing them. I am not sure whether we need an option to print maps in the raw order. The OTB will have to decide on how to handle printing, too.

There is one bug introduced by this PR, namely that term_to_binary(SmallMapWithAtomKeys, [deterministic]) is broken. That bug will also make BEAM files non-deterministic because the compiler uses term_to_binary/2 with the deterministic option.

Your PR got me thinking about alternative ways to speed up comparisons of atoms. More about my ideas in the next comment.

Jul 27 '22 10:07 bjorng

Here is the code that compares atoms:

https://github.com/erlang/otp/blob/93f8daa84b351e1ba4f135bec1b7803004dd3a34/erts/emulator/beam/erl_utils.h#L188-L215

If the first 4 characters (more correct: 3 characters and 7 bits) in the atoms to be compared are different, the comparison can be done relatively cheap without calling memcmp(). But if they are equal, memcmp() need to be called.

If the keys are all atoms named 'atom-INDEX' as in your commit message, memcmp() will always have to be called. One idea for optimisations would be store the 8 first characters (strictly speaking: 7 characters and 7 bits) for each atom in the atom entry. For many atoms named 'atom-INDEX', memcmp() would not have to be called. The cost for that would be 4 more bytes for each atom. (On a 32-bit machine, I suggest keeping the current implementation.)

I have not done any measurements to find out how much faster that would be compared to the current implementation.

Jul 27 '22 11:07 bjorng

Thank you @bjorng!

Comments related to the board discussion:

There could be applications or libraries that depend on maps:to_list(SmallMap) always returning a sorted list. I have no idea common that could be.

To play devil's advocate: we may have other reasons to change how SmallMaps are implemented in the future, so the sooner we make it unordered, theoretically the better. Although I understand even then it may be better to just wait if the need arises (it might never!).

Regarding printing, I think that the easiest way to handle it is to modify io:format() to always sort maps when printing them.

My concern with sorting on printing is when you convert it to list or iterate it, you will see a different order. :(

Comments about optimizations:

If the keys are all atoms named 'atom-INDEX' as in your commit message, memcmp() will always have to be called. One idea for optimisations would be store the 8 first characters (strictly speaking: 7 characters and 7 bits) for each atom in the atom entry.

Oh, interesting. So I accidentally triggered a bad path!

I also wonder if we could further optimize ASCII atoms? Use 1 bit to denote if ASCII or not. If ASCII, then you can pack the double amount of characters? The downside is that comparing an ASCII atom with a non-ASCII atom will always trigger the slow path but I am not sure how common that is.

Jul 27 '22 11:07 josevalim

I also wonder if we could further optimize ASCII atoms? Use 1 bit to denote if ASCII or not. If ASCII, then you can pack the double amount of characters?

No, you can't gain that much. Atoms are stored in UTF8, which means that ASCII characters (code points 0 through 127) are already stored in only one byte. You could do some tricks such as only storing 7 bits from each ASCII character and fit 9 characters into 63 bits, or even more by subtracting off 32 from each character and multiplying by 96 instead of left-shifting 7 positions. I doubt that the more complicated code is worth it.

(I will take two days off and be back on Monday.)

Jul 27 '22 14:07 bjorng

@josevalim Could you modify your benchmark to only use unique atoms as maps keys?

That could give us some idea of how expensive the call to strcmp() is and whether we should optimise map comparisons in the way I suggested.

Aug 15 '22 11:08 bjorng

Here are the results.

I have done two benchmarks:

Using the fields defined in the Plug struct in Elixir

fields = [
  __struct__: Plug.Conn.Struct,
  adapter: {Plug.MissingAdapter, nil},
  assigns: %{},
  body_params: %{aspect: :body_params},
  cookies: %{aspect: :cookies},
  halted: false,
  host: "www.example.com",
  method: "GET",
  owner: nil,
  params: %{aspect: :params},
  path_info: [],
  path_params: %{},
  port: 0,
  private: %{},
  query_params: %{aspect: :query_params},
  query_string: "",
  remote_ip: nil,
  req_cookies: %{aspect: :cookies},
  req_headers: [],
  request_path: "",
  resp_body: nil,
  resp_cookies: %{},
  resp_headers: [{"cache-control", "max-age=0, private, must-revalidate"}],
  scheme: :http,
  script_name: [],
  secret_key_base: nil,
  state: :unset,
  status: nil
]

Generating the fields names randomly using Crypto

fields = for _

I measured three different cases:

sorted - all of the fields are sorted as they are meant to be laid out in the map
reverse - the reverse of the above
random - random order

Here are the results:

# This Pull Request

## Plug optimized

Name                        ips        average  deviation         median         99th %
from_list_sorted         6.63 M      150.77 ns ±16517.78%         125 ns         166 ns
from_list_random         3.54 M      282.56 ns  ±5301.64%         250 ns         333 ns
from_list_reverse        2.31 M      433.78 ns  ±4077.19%         375 ns         500 ns

## Crypto

Name                        ips        average  deviation         median         99th %
from_list_sorted         5.15 M      194.13 ns ±11840.57%         125 ns        1041 ns
from_list_random         4.65 M      215.27 ns ±15052.50%         125 ns        1042 ns
from_list_reverse        1.81 M      551.51 ns  ±3062.57%         500 ns         625 ns

# maint/master

## Plug

Name                        ips        average  deviation         median         99th %
from_list_sorted         6.38 M      156.69 ns ±13300.95%         125 ns         167 ns
from_list_random         1.62 M      616.90 ns  ±3391.64%         542 ns         667 ns
from_list_reverse        1.01 M      991.50 ns  ±1782.21%         917 ns        1084 ns

## Crypto

Name                        ips        average  deviation         median         99th %
from_list_sorted         5.81 M      172.04 ns ±16012.36%         125 ns         167 ns
from_list_random         1.54 M      649.63 ns  ±2985.03%         583 ns         709 ns
from_list_reverse        0.82 M     1223.81 ns  ±1056.50%        1167 ns        1375 ns

My TL;DR - in practice this new implementation is 2-3x faster.

Note the most efficient approach is to give sorted fields. However, giving sorted fields is harder to do with this PR. At the same time, I am not quite sure if always giving sorted fields is practical for most use cases.

Also note other operations could benefit from in-memory ordering, it may be worth exploring those too (suggestions welcome).

Aug 15 '22 13:08 josevalim

Additional results using completely random keys of 8 elements each:

maps:merge/2 of two maps with 16 distinct keys each is roughly 8% faster
maps:merge/2 of two maps of 32 equal keys is roughly 2.1x faster

Aug 15 '22 14:08 josevalim

Some words of warning. Even though the strict ordering we internally use for small map keys is not defined as an iteration order, it does "leak out" as the defined order when comparing maps. For example:

true = #{1 => v} < {1.0 => v}.
true = #{a1 => v} < #{a2 => v}.

These are both defined because 1 :< 1.0 and a1 :< a2. Where :< is the non existing operator for the defined strict term order used only to order map keys.

I first thought this PR would break this map ordering, but it seems not to, as the same atom ordering code happens to be duplicated in several places.

Aug 15 '22 18:08 sverker

I was wrong, this PR does actually break map comparison:

1> {a1,a2,a3}.                                 
{a1,a2,a3}
2> a1 < a2.
true
3> a2 < a3.
true
4> #{a1 => v, true => v} < #{a2 => v, a3 => v}.
false

true has a low index and will be sorted as first, and then compared alphabetically larger than a2.

Aug 15 '22 22:08 sverker

Perfect. We will need to sort maps before </> then, which will definitely make it slower.

Aug 16 '22 06:08 josevalim

It will be slower, but can actually be done O(n) with the same algorithm as for hashmaps where the keys are ordered by hash value.

https://github.com/erlang/otp/blob/master/erts/emulator/beam/utils.c#L1794

It's more comparisons and we can't bail out at first found diff.

Aug 16 '22 10:08 sverker

OTP Technical Board says:

We want this to boosts performance of frequent small map atom key comparisons at the expense of much less common ordering of entire maps. maps:to_list and other iterations should use internal "arbitrary" order. Printing should also be done in internal order, but we propose an io:format modifier to print maps sorted.

io:format("~kp", [Map]).

This PR.

Must implement correct small map comparison, probably using same O(n) algorithm as is used for large maps.
Must support three different ways of ordering terms inside the emulator and call them correctly.

Documented normal Erlang term order.
Documented map key order where integers are less than floats.
Internal small map key order. Like (2) but atoms are ordered by index.

Sep 20 '22 18:09 sverker

Thank you! I will probably be able to carve time to complete this in 3 months or so. Therefore, if someone wants to take this from my hands and move forward, please go ahead. :) Otherwise I will get to it eventually.

Sep 20 '22 18:09 josevalim

Amendment: It's probably enough to only order top level atoms by index, like this PR does. That will make the code less complex I think.

Sep 20 '22 18:09 sverker

To be continued. Get the map comparison correct as documented.

Dec 15 '22 23:12 sverker

Forced pushed a working version rebased on a fresh master.

Jan 13 '23 22:01 sverker

Squashed into one commit to soon be merged to master.

Jan 24 '23 12:01 sverker

Merged to master. Thanks @josevalim for the initial idea, commit and benchmarks.

To be continued in #6718 to restore some order.

Jan 24 '23 17:01 sverker

Thank you for doing all of the heavy lifting!

Jan 24 '23 17:01 josevalim

otp otp copied to clipboard

Use in-memory atom ordering for map ordering

CT Test Results

Artifacts

otp
otp copied to clipboard