micropython icon indicating copy to clipboard operation
micropython copied to clipboard

py/obj.h: Optimise mp_obj_type_t storage (v2)

Open jimmo opened this issue 3 years ago • 31 comments

This is v2 of https://github.com/micropython/micropython/pull/7542, which itself is based on https://github.com/adafruit/circuitpython/pull/4903

This version only includes the "slot" approach, and replaces the existing sparse representation. The basic idea is that where the mp_obj_type_t struct used to have 12 pointer fields, it now has 12 uint8_t indices, and a variable-length array of pointers. So in the best case (no fields used) we save 12x4-12=36 bytes and in the common case (three fields used) we save 9x4-12=24 bytes. See #7542 and the spreadsheet linked there for more analysis.

The size difference is:

   bare-arm: -1204 -2.092% 
minimal x86: -3101 -1.885% [incl -3488(data)]
   unix x64: -5712 -1.093% [incl -6432(data)]
unix nanbox: -2836 -0.618% [incl -3392(data)]
      stm32: -2896 -0.732% PYBV10
     cc3200: -2104 -1.143% 
      esp32: -2280 -0.151% GENERIC[incl -3440(data)]
        nrf: -2276 -1.225% pca10040
        rp2: -2432 -0.480% PICO
       samd: -1660 -1.181% ADAFRUIT_ITSYBITSY_M4_EXPRESS

As discussed in #7542, with some changes to objtype.c:mp_obj_class_lookup we can reduce the slot size to 4-bits, and save a further 6 bytes per object.

The performance change on PYBV11 is

bm_chaos.py                    364.78 ->     354.80 :      -9.98 =  -2.736% (+/-0.01%)
bm_fannkuch.py                  77.61 ->      75.87 :      -1.74 =  -2.242% (+/-0.00%)
bm_fft.py                     2569.63 ->    2525.30 :     -44.33 =  -1.725% (+/-0.00%)
bm_float.py                   5954.45 ->    5750.53 :    -203.92 =  -3.425% (+/-0.02%)
bm_hexiom.py                    43.52 ->      43.14 :      -0.38 =  -0.873% (+/-0.01%)
bm_nqueens.py                 4404.68 ->    4356.57 :     -48.11 =  -1.092% (+/-0.00%)
bm_pidigits.py                 630.48 ->     630.86 :      +0.38 =  +0.060% (+/-0.23%)
misc_aes.py                    413.58 ->     406.96 :      -6.62 =  -1.601% (+/-0.00%)
misc_mandel.py                3536.21 ->    3446.73 :     -89.48 =  -2.530% (+/-0.01%)
misc_pystone.py               2337.25 ->    2325.08 :     -12.17 =  -0.521% (+/-0.01%)
misc_raytrace.py               377.76 ->     367.50 :     -10.26 =  -2.716% (+/-0.01%)

and Unix is

bm_chaos.py                   6075.93 ->    6132.92 :     +56.99 =  +0.938% (+/-3.40%)
bm_fannkuch.py                  26.53 ->      26.76 :      +0.23 =  +0.867% (+/-1.06%)
bm_fft.py                    29450.94 ->   30485.22 :   +1034.28 =  +3.512% (+/-1.26%)
bm_float.py                  66775.28 ->   68260.69 :   +1485.41 =  +2.224% (+/-3.14%)
bm_hexiom.py                   187.63 ->     186.86 :      -0.77 =  -0.410% (+/-0.70%)
bm_nqueens.py                97709.83 ->   97778.66 :     +68.83 =  +0.070% (+/-0.95%)
bm_pidigits.py                9038.39 ->    9061.83 :     +23.44 =  +0.259% (+/-0.30%)
misc_aes.py                    119.79 ->     119.91 :      +0.12 =  +0.100% (+/-3.24%)
misc_mandel.py               72987.13 ->   74526.80 :   +1539.67 =  +2.110% (+/-2.55%)
misc_pystone.py              26630.69 ->   26863.56 :    +232.87 =  +0.874% (+/-2.83%)
misc_raytrace.py             10729.21 ->   10660.26 :     -68.95 =  -0.643% (+/-1.12%)

This is marginally worse than the initial performance results I reported in #7542 (same test setup, but comparing to a different baseline). I'm not sure this is worth making configurable because it forms part of the .mpy ABI.

I have split the commits more than necessary, but in some cases just to make it easier to review. For example, I think the last two (1fe8dbd and a8cd503) should be combined, as well as combining 9dcd6c8, a30e9ef, a2dec50.

jimmo avatar Jun 24 '22 11:06 jimmo

I'm not sure this is worth making configurable because it forms part of the .mpy ABI.

Does the .mpy version need to be bumped with this change?

dpgeorge avatar Jun 24 '22 12:06 dpgeorge

Rebased.

Current performance diff on PYBV11 is

N=100 M=100                /home/jimmo/obj-type-v2-rebase-pybv11-baseline -> /home/jimmo/obj-type-v2-rebase-pybv11-def0         diff      diff% (error%)
bm_chaos.py                    350.72 ->     346.16 :      -4.56 =  -1.300% (+/-0.00%)
bm_fannkuch.py                  76.62 ->      74.92 :      -1.70 =  -2.219% (+/-0.01%)
bm_fft.py                     2327.18 ->    2298.25 :     -28.93 =  -1.243% (+/-0.00%)
bm_float.py                   5680.06 ->    5634.47 :     -45.59 =  -0.803% (+/-0.01%)
bm_hexiom.py                    47.20 ->      46.26 :      -0.94 =  -1.992% (+/-0.00%)
bm_nqueens.py                 4345.23 ->    4246.35 :     -98.88 =  -2.276% (+/-0.01%)
bm_pidigits.py                 625.44 ->     622.29 :      -3.15 =  -0.504% (+/-0.20%)
misc_aes.py                    409.71 ->     406.47 :      -3.24 =  -0.791% (+/-0.01%)
misc_mandel.py                3083.09 ->    3036.62 :     -46.47 =  -1.507% (+/-0.01%)
misc_pystone.py               2419.22 ->    2404.86 :     -14.36 =  -0.594% (+/-0.01%)
misc_raytrace.py               358.06 ->     353.56 :      -4.50 =  -1.257% (+/-0.01%)

and the code size diff on PYBV11 is -2704.

I updated objtype.c and objnamedtuple.c to no longer always use the full 12-slot version, such that instance classes now still fit in 4 GC words (as long as the base doesn't have a protocol), and named tuple comes out smaller than it used to be.

Only TODO now is a way to split the "supported native ABI version from the supported bytecode ABI version". Current idea is that to run a native .mpy (i.e. one with an arch set) you need to exact match, whereas to run bytecode .mpy you need "at least this version". So this PR will move the current version to 7, but keeping the minimum at 6.

jimmo avatar Sep 13 '22 15:09 jimmo

OK I think this is finally ready.

Because this will be a breaking change for people with custom types (e.g. user c modules), I thought that #9302 might be a nice way to help people update.

Does the .mpy version need to be bumped with this change?

See #9303

jimmo avatar Sep 14 '22 02:09 jimmo

Have you done any profiling to see which slots are accessed the most? For example, I'm wondering if making locals_dict not a slot would improve performance since it is used in nearly every type and is accessed frequently

dlech avatar Sep 15 '22 16:09 dlech

Have you done any profiling to see which slots are accessed the most?

That's a good question... All the analysis I did (see #7542) was entirely on code size and so only focused on which slots are used for given types, not how frequently they're accessed.

For example, I'm wondering if making locals_dict not a slot would improve performance since it is used in nearly every type and is accessed frequently.

locals_dict is only used in 66 (of 138) types defined in the PYBV11 firmware. But it's accessed very frequently.

The other problem is that the 12 slot indices work out perfectly to 3 words. Having one fewer slot wouldn't decrease the size of the index "table" while still adding a word to every type that doesn't use locals_dict. So +288 bytes. However this would be offset by replacing the MP_OBJ_TYPE_{GET,HAS}_SLOT invocations with direct access to locals_dict in about 10 places.

It's probably worth doing the experiment.

jimmo avatar Sep 15 '22 23:09 jimmo

Have you done any profiling to see which slots are accessed the most? For example, I'm wondering if making locals_dict not a slot would improve performance since it is used in nearly every type and is accessed frequently

This is worth looking in to, but I suggest to get this PR in as-is first, to act as a baseline.

dpgeorge avatar Sep 16 '22 00:09 dpgeorge

@dlech I tested this and the result was +232 bytes (+288 bytes as predicted for adding the slot and changing the definitions, then -56 bytes to use type->locals_dict directly) and mixed results on the performance, some tests improved slightly, some got worse. I suspect most of this is driven by changes in the code layout.

Here's the performance diff from the current PR.

$ ./run-perfbench.py -s ~/obj-type-v2-rebase2-pybv11-pr ~/obj-type-v2-rebase2-pybv11-locals 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/obj-type-v2-rebase2-pybv11-pr -> /home/jimmo/obj-type-v2-rebase2-pybv11-locals         diff      diff% (error%)
bm_chaos.py                    348.30 ->     347.37 :      -0.93 =  -0.267% (+/-0.01%)
bm_fannkuch.py                  74.95 ->      74.73 :      -0.22 =  -0.294% (+/-0.00%)
bm_fft.py                     2293.41 ->    2308.07 :     +14.66 =  +0.639% (+/-0.00%)
bm_float.py                   5660.72 ->    5635.56 :     -25.16 =  -0.444% (+/-0.00%)
bm_hexiom.py                    46.71 ->      46.16 :      -0.55 =  -1.177% (+/-0.00%)
bm_nqueens.py                 4262.27 ->    4255.89 :      -6.38 =  -0.150% (+/-0.00%)
bm_pidigits.py                 622.26 ->     620.58 :      -1.68 =  -0.270% (+/-0.24%)
misc_aes.py                    405.77 ->     405.82 :      +0.05 =  +0.012% (+/-0.00%)
misc_mandel.py                3035.08 ->    3065.90 :     +30.82 =  +1.015% (+/-0.01%)
misc_pystone.py               2401.84 ->    2386.13 :     -15.71 =  -0.654% (+/-0.01%)
misc_raytrace.py               356.15 ->     355.20 :      -0.95 =  -0.267% (+/-0.00%)

Damien and I talked about using some of the saved bytes from this PR to instead enable -O2 in more places, which preliminary testing shows a worthwhile improvement.

jimmo avatar Sep 16 '22 04:09 jimmo

From a practical point of view: could the relevant commit, or otherwise some explanation here, include the basic instructions needed to convert from the current implementation to the new one with slots? It's pretty clear from this PR's diff but would be helpful nonetheless.

stinos avatar Sep 16 '22 06:09 stinos

From a practical point of view: could the relevant commit, or otherwise some explanation here, include the basic instructions needed to convert from the current implementation to the new one with slots? It's pretty clear from this PR's diff but would be helpful nonetheless.

Yes, for sure, once the details are finalised. (We're still experimenting with another idea which might change the macro slightly).

I am also implementing https://github.com/micropython/micropython/pull/9302 (and the corresponding update to the wiki) so that people who aren't following the GitHub closely can see instructions too.

jimmo avatar Sep 16 '22 08:09 jimmo

Updated with two additional changes:

Merged the getiter and iternext slots. On PYBV11:

  • 22 types use mp_identity_getiter, mp_stream_unbuffered_iter iternext
  • 19 types use mp_identity_getiter, custom iternext
  • 22 type use custom getiter, no iternext
  • 2 types use custom getiter, custom iternext This allows about 40 slots to be removed (and handled as flags instead).

Used the now freed up slot to convert make_new to a slot. This has a marginal code size improvement in the type definitions because only about 30 types on PYBV11 don't have a make_new, and the extra code to access make_new as a slot almost cancels it out. However it makes the definition of object types more consistent because make_new works the same way as the other fields.

This is overall worth another -184 bytes on PYBV11.

Overall performance on PYBV11:

$ ./run-perfbench.py -s ~/obj-type-v2-rebase2-pybv11-baseline ~/obj-type-v2-rebase2-pybv11-makenew-z
diff of scores (higher is better)
N=100 M=100                /home/jimmo/obj-type-v2-rebase2-pybv11-baseline -> /home/jimmo/obj-type-v2-rebase2-pybv11-makenew-z         diff      diff% (error%)
bm_chaos.py                    349.91 ->     352.46 :      +2.55 =  +0.729% (+/-0.00%)
bm_fannkuch.py                  76.64 ->      74.54 :      -2.10 =  -2.740% (+/-0.01%)
bm_fft.py                     2331.87 ->    2292.75 :     -39.12 =  -1.678% (+/-0.00%)
bm_float.py                   5672.31 ->    5702.20 :     +29.89 =  +0.527% (+/-0.02%)
bm_hexiom.py                    47.34 ->      46.13 :      -1.21 =  -2.556% (+/-0.00%)
bm_nqueens.py                 4348.54 ->    4215.16 :    -133.38 =  -3.067% (+/-0.00%)
bm_pidigits.py                 624.95 ->     621.48 :      -3.47 =  -0.555% (+/-0.19%)
misc_aes.py                    409.04 ->     406.10 :      -2.94 =  -0.719% (+/-0.01%)
misc_mandel.py                3084.99 ->    3054.89 :     -30.10 =  -0.976% (+/-0.01%)
misc_pystone.py               2416.89 ->    2421.10 :      +4.21 =  +0.174% (+/-0.01%)
misc_raytrace.py               356.58 ->     361.50 :      +4.92 =  +1.380% (+/-0.00%)

And rp2:

$ ./run-perfbench.py -s ~/obj-type-v2-rebase2-rp2-baseline ~/obj-type-v2-rebase2-rp2-makenew-z
diff of scores (higher is better)
N=100 M=100                /home/jimmo/obj-type-v2-rebase2-rp2-baseline -> /home/jimmo/obj-type-v2-rebase2-rp2-makenew-z         diff      diff% (error%)
bm_chaos.py                    198.13 ->     202.40 :      +4.27 =  +2.155% (+/-0.06%)
bm_fannkuch.py                  54.99 ->      52.24 :      -2.75 =  -5.001% (+/-0.02%)
bm_fft.py                     1503.69 ->    1499.36 :      -4.33 =  -0.288% (+/-0.01%)
bm_float.py                   3474.02 ->    3668.16 :    +194.14 =  +5.588% (+/-0.11%)
bm_hexiom.py                    25.13 ->      26.56 :      +1.43 =  +5.690% (+/-0.04%)
bm_nqueens.py                 2535.35 ->    2626.00 :     +90.65 =  +3.575% (+/-0.04%)
bm_pidigits.py                 411.08 ->     402.21 :      -8.87 =  -2.158% (+/-0.02%)
misc_aes.py                    281.69 ->     276.64 :      -5.05 =  -1.793% (+/-0.06%)
misc_mandel.py                1772.39 ->    1819.05 :     +46.66 =  +2.633% (+/-0.04%)
misc_pystone.py               1408.17 ->    1465.57 :     +57.40 =  +4.076% (+/-0.16%)
misc_raytrace.py               211.03 ->     209.02 :      -2.01 =  -0.952% (+/-0.06%)

jimmo avatar Sep 17 '22 17:09 jimmo

Codecov Report

Merging #8813 (d0253d0) into master (0e8c220) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #8813   +/-   ##
=======================================
  Coverage   98.38%   98.38%           
=======================================
  Files         156      156           
  Lines       20477    20496   +19     
=======================================
+ Hits        20147    20166   +19     
  Misses        330      330           
Impacted Files Coverage Δ
extmod/machine_mem.c 25.00% <ø> (ø)
extmod/machine_pinbase.c 100.00% <ø> (ø)
extmod/machine_signal.c 92.10% <ø> (ø)
extmod/moduasyncio.c 99.20% <ø> (ø)
extmod/moducryptolib.c 91.30% <ø> (ø)
extmod/moductypes.c 97.84% <ø> (ø)
extmod/moduhashlib.c 100.00% <ø> (ø)
extmod/modussl_mbedtls.c 86.06% <ø> (ø)
extmod/modutimeq.c 100.00% <ø> (ø)
extmod/moduwebsocket.c 90.47% <ø> (ø)
... and 60 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Sep 17 '22 19:09 codecov-commenter

Updated.

jimmo avatar Sep 19 '22 06:09 jimmo

Merged in fb2a57800acffd811d05373dcf63e95b4048d0c6 through b41aaaa8a918a6645ebc6bfa4483bd17286f9263

Thank you very much! A lot of work went into this.

dpgeorge avatar Sep 19 '22 09:09 dpgeorge

@jimmo This is surely quite a piece of work. For me it creates a mixed feeling, because that PR requires a lot of work to be done at the PR's I have open to make it compatible again. Especially for such large ones like the SAMD PR and the smaller ones for mimxrt. Not to mention the Winner W600 port, which I try to keep updated. Not your task. You adapted the existing ports, but can hardly change all open PRs. That's to be done by the people who make the PR. So, could you give some guidance an explanation, how this adaption must be done. It seems to be focused mainly on the type declaration. As an example, how would the type below be changed?

const mp_obj_type_t machine_encoder_type = {
    { &mp_type_type },
    .name = MP_QSTR_Encoder,
    .print = mp_machine_encoder_print,
    .make_new = mp_machine_encoder_make_new,
    .locals_dict = (mp_obj_dict_t *)&machine_encoder_locals_dict,
};

I imagine the style, but a few explaining words would help.

robert-hh avatar Sep 19 '22 10:09 robert-hh

@robert-hh I promised @stinos too that I would write some instructions. Haven't had a chance to come back to this since the PR was finalised earlier this arvo. (At the time that @stinos asked, we hadn't decided on the getiter and make_new changes).

Earlier today I wrote this -- https://github.com/micropython/micropython/wiki/Build-Troubleshooting#define_const_obj_type This is the page you now get a link to when you have a compile error (see #9302).

But there is more to add, and I will add some quick notes now, and then will write some more hopefully tonight otherwise tomorrow. In particular I need to write more context about why the change needs to be made and what the macro does.

jimmo avatar Sep 19 '22 11:09 jimmo

@jimmo Please forget my rant about this PR. I could update the mimxrt and samd PRs quite smoothly. Only the property is lost, that you could stop at any point in the commit tree and still have a state, that can be compiled and run. To keep that, it would require a lot of editing the history. But still the documentation will be helpful.

robert-hh avatar Sep 19 '22 14:09 robert-hh

Only the property is lost, that you could stop at any point in the commit tree and still have a state, that can be compiled and run.

@robert-hh Can you clarify what you mean here... every commit from this PR should have compiled and run (or at least that was my intention and I did a lot of rebasing while I was working to make this the case, into the hundreds of times as we revised it over and over... it's possible I missed something though).

Or is this more a matter for a rebased pending PR (i.e. your samd branch?). It should be possible to do the rebase in a way such that every commit still compiles?

jimmo avatar Sep 19 '22 14:09 jimmo

I have no doubt that you tested everything carefully.

Or is this more a matter for a rebased pending PR (i.e. your samd branch?).

That's it.

It should be possible to do the rebase in a way such that every commit still compiles?

I know. But then each commit, which introduces a new class, would have to be changed separately in the history. That is possible with git rebase -i, but I do that typically with fingers crossed (and a backup). I had cases where that damaged the repository.

robert-hh avatar Sep 19 '22 15:09 robert-hh

And before I forget it (again): One really good aspect of this PR is, that it reduces the code size. At the SAMD port it's about 2.5 k.

robert-hh avatar Sep 19 '22 15:09 robert-hh

And before I forget it (again): One really good aspect of this PR is, that it reduces the code size. At the SAMD port it's about 2.5 k.

Thanks! And also huge shoutout to @jepler for the original idea and implementation.

2.5k might might be only 2% of the firmware size, but it offsets other useful features (or optimisations), and to a build that is only just missing the flash budget it makes all the difference. Hopefully worth the once-off pain of migration.

jimmo avatar Sep 19 '22 23:09 jimmo

Indeed. At the SAM21 port I started fighting for almost every byte. Just one more question about the required chaneg. For that case in the getiter part:

  • If you had getiter as mp_identity_getiter and mp_stream_unbuffered_iter as iternext, then just set the MP_TYPE_FLAG_ITERNEXT_STREAM and do not set the getiter slot.

You mention the getiter slot, but what about the iternext slot in this case.

robert-hh avatar Sep 20 '22 09:09 robert-hh

You mention the getiter slot, but what about the iternext slot in this case.

There is now only an iter slot and its behaviour depends on the four possible flag values.

jimmo avatar Sep 20 '22 09:09 jimmo

So in that specific case, now just the flag is set and the iter slot should be unset. (There was about 20 instances of this in PYBV11, so 80 bytes saved)

jimmo avatar Sep 20 '22 09:09 jimmo

Thanks.- Just a note: the flag MP_TYPE_FLAG_ITERNEXT_STREAM does not exist. Is MP_TYPE_FLAG_ITER_IS_STREAM the proper choice?

robert-hh avatar Sep 20 '22 09:09 robert-hh

Thanks.- Just a note: the flag MP_TYPE_FLAG_ITERNEXT_STREAM does not exist. Is MP_TYPE_FLAG_ITER_IS_STREAM the proper choice?

Ah thank you yes that's right. I will update.

We did a last minute rename after I wrote the first version of the wiki page.

jimmo avatar Sep 20 '22 09:09 jimmo

Sorry, what I wrote on the wiki was out of date and clearly confusing. I have updated it

jimmo avatar Sep 20 '22 10:09 jimmo

Is the PR reaching the final stage now? So I know when to test against it

stinos avatar Sep 20 '22 10:09 stinos

Is the PR reaching the final stage now? So I know when to test against it

It's merged.

jimmo avatar Sep 20 '22 10:09 jimmo

Oops missed that; then I better get started :)

stinos avatar Sep 20 '22 10:09 stinos

Sorry, what I wrote on the wiki was out of date and clearly confusing.

It's a good documentation, and any documentation is better than none.

robert-hh avatar Sep 20 '22 11:09 robert-hh