cpython icon indicating copy to clipboard operation
cpython copied to clipboard

Replace _Py_IDENTIFIER() with statically initialized objects.

Open ericsnowcurrently opened this issue 3 years ago • 40 comments

BPO 46541
Nosy @vstinner, @tiran, @ericsnowcurrently, @serhiy-storchaka, @corona10, @erlend-aasland, @kumaraditya303, @notarealdeveloper
PRs
  • python/cpython#30928
  • python/cpython#30941
  • python/cpython#31261
  • python/cpython#31344
  • python/cpython#31346
  • python/cpython#31351
  • python/cpython#31358
  • python/cpython#31363
  • python/cpython#31364
  • python/cpython#31372
  • python/cpython#31375
  • python/cpython#31376
  • python/cpython#30310
  • python/cpython#31468
  • python/cpython#31475
  • python/cpython#31599
  • python/cpython#31608
  • python/cpython#31609
  • python/cpython#31683
  • python/cpython#32063
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ericsnowcurrently'
    closed_at = None
    created_at = <Date 2022-01-26.22:01:04.421>
    labels = ['interpreter-core', '3.11']
    title = 'Replace _Py_IDENTIFIER() with statically initialized objects.'
    updated_at = <Date 2022-03-23.15:52:58.371>
    user = 'https://github.com/ericsnowcurrently'
    

    bugs.python.org fields:

    activity = <Date 2022-03-23.15:52:58.371>
    actor = 'eric.snow'
    assignee = 'eric.snow'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2022-01-26.22:01:04.421>
    creator = 'eric.snow'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 46541
    keywords = ['patch']
    message_count = 28.0
    messages = ['411799', '411800', '411910', '411987', '412229', '412405', '412488', '412868', '413047', '413086', '413246', '413274', '413311', '413312', '413323', '413333', '413334', '413337', '413355', '413374', '413384', '413653', '414178', '414182', '414265', '414274', '414534', '415881']
    nosy_count = 8.0
    nosy_names = ['vstinner', 'christian.heimes', 'eric.snow', 'serhiy.storchaka', 'corona10', 'erlendaasland', 'kumaraditya', 'notarealdeveloper']
    pr_nums = ['30928', '30941', '31261', '31344', '31346', '31351', '31358', '31363', '31364', '31372', '31375', '31376', '30310', '31468', '31475', '31599', '31608', '31609', '31683', '32063']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue46541'
    versions = ['Python 3.11']
    
    • PR: gh-98956
    • PR: gh-98957
    • PR: gh-99010
    • PR: gh-99012
    • PR: gh-99054
    • PR: gh-99067
    • PR: gh-99210
    • PR: gh-99236

    ericsnowcurrently avatar Jan 26 '22 22:01 ericsnowcurrently

    _Py_Identifier has been useful but at this point there is a faster and simpler approach we could take as a replacement: statically initialize the objects as fields on _PyRuntimeState and reference them directly through a macro.

    This would involve the following:

    • add a PyUnicodeObject field (not a pointer) to _PyRuntimeStatefor each string that currently uses_Py_IDENTIFIER()`
    • initialize each object as part of the static initializer for _PyRuntimeState
    • make each "immortal" (e.g. start with a really high refcount)
    • add a macro to look up a given string
    • update each location that currently uses _Py_IDENTIFIER() to use the new macro instead

    As part of this, we would also do the following:

    • get rid of all C-API functions with _Py_Identifer parameters
    • get rid of the old runtime state related to identifiers
    • get rid of _Py_Identifier, _Py_IDENTIFIER(), etc.

    (Note that there are several hundred uses of _Py_IDENTIFIER(), including a number of duplicates.)

    Pros:

    • reduces indirection (and extra calls) for C-API using the strings (making the code easier to understand and speeding it up)
    • the objects are referenced from a fixed address in the static data section (speeding things up and allowing the C compiler to optimize better)
    • there is no lazy allocation (or lookup, etc.) so there are fewer possible failures when the objects get used (thus less error return checking)
    • simplifies the runtime state
    • saves memory (at little, at least)
    • the approach for per-interpreter is simpler (if needed)
    • reduces the number of static variables in any given C module
    • reduces the number of functions in the ("private") C-API
    • "deep frozen" modules can use these strings
    • other commonly-used strings could be pre-allocated by adding _PyRuntimeState fields for them

    Cons:

    • churn
    • adding a string to the list requires modifying a separate file from the one where you actually want to use the string
    • strings can get "orphaned" (we could prevent this with a check in make check)
    • some PyPI packages may rely on _Py_IDENTIFIER() (even though it is "private" C-API)
    • some strings may never get used for any given ./python invocation

    Note that with a basic partial implementation (GH-30928) I'm seeing a 1% improvement in performance (see https://github.com/faster-cpython/ideas/issues/230).

    ericsnowcurrently avatar Jan 26 '22 22:01 ericsnowcurrently

    ## Background ##

    _Py_Identifier (and _Py_IDENTIFIER(), etc.) was added in 2011 [1][2] for several reasons:

    • provide a consistent approach for a common optimization: caching C-string based string objects
    • facilitate freeing those objects during runtime finalization

    The solution involved using a static variable defined, using _Py_IDENTIFIER() near the code that needed the string. The variable (a _Py_Identifier) would hold the desired C string (statically initialized) and the corresponding (lazily created) PyUnicodeObject. The code where the _Py_Identifier was defined would then pass it to specialized versions of various C-API that would normally consume a C string or PyUnicodeObject. Then that code would use either the C-string or the object (creating and caching it first if not done already). This approach decentralized the caching but also provided a single tracking mechanism that made it easier to clean up the objects.

    Over the last decade a number of changes were made, including recent changes to make the identifiers per-interpreter and to use a centralized cache.

    [1] https://github.com/python/cpython/commit/afe55bba33a20f87a58f940186359237064b428f [2] https://mail.python.org/archives/list/[email protected]/message/FRUTTE47JO2XN3LXV2J4VB5A5VILILLA/

    ericsnowcurrently avatar Jan 26 '22 22:01 ericsnowcurrently

    New changeset 247480a21cb165efdacc346a2d589dfc27e18283 by Eric Snow in branch 'main': bpo-46541: Generate the global objects initializer. (gh-30941) https://github.com/python/cpython/commit/247480a21cb165efdacc346a2d589dfc27e18283

    ericsnowcurrently avatar Jan 27 '22 18:01 ericsnowcurrently

    _Py_Identifier has been useful but at this point there is a faster and simpler approach we could take as a replacement: statically initialize the objects as fields on _PyRuntimeState and reference them directly through a macro.

    This change is going to break projects in the wild. Yes, people use the _Py_IDENTIFIER(), _PyUnicode_FromId() and other "Id" variant of many functions in 3rd party projects.

    Is it possible to keep runtime initialization if this API is used by 3rd party code?

    vstinner avatar Jan 28 '22 11:01 vstinner

    If necessary, we can keep _Py_IDENTIFIER() (and the functions). Regardless, we can stop using it internally.

    ericsnowcurrently avatar Feb 01 '22 00:02 ericsnowcurrently

    FYI, I've posted to python-dev for feedback before proceeding:

    https://mail.python.org/archives/list/[email protected]/thread/DNMZAMB4M6RVR76RDZMUK2WRLI6KAAYS/

    ericsnowcurrently avatar Feb 02 '22 22:02 ericsnowcurrently

    (thanks Victor: https://mail.python.org/archives/list/[email protected]/message/7RMLIJHUWVBZFV747TFEHOE6LNBVQSMM/)

    3rd party use of _Py_IDENTIFIER():

    • blender + https://github.com/blender/blender/blob/master/source/blender/python/intern/bpy_traceback.c#L53 - copied from core code - "msg", "filename", "lineno", "offset", "text", "<string>" - uses _PyObject_GetAttrId()
    • datatable + https://github.com/h2oai/datatable/blob/45a87337bc68576c7fb6900f524925d4fb77d6a6/src/core/python/obj.cc#L76 - in C++ wrapper getting sys.stdout, etc. and writing to sys.stdout - has to hack around C++14 support - has a fallback under limited API - "stdin", "stdout", "stderr", "write" - uses _PySys_GetObjectId(), _PyObject_GetAttrId()
    • multidict (in aiohttp) + https://github.com/aio-libs/multidict/blob/6dedb623cca8e8fe64f502dfa479826efc321385/multidict/_multilib/defs.h#L8 + https://github.com/aio-libs/multidict/blob/6dedb623cca8e8fe64f502dfa479826efc321385/multidict/_multilib/istr.h#L46 + https://github.com/aio-libs/multidict/blob/6dedb623cca8e8fe64f502dfa479826efc321385/multidict/_multilib/pair_list.h#L114 - calling str.lower() - "lower" - uses _PyObject_CallMethodId()
    • mypy (exclusively in mypyc, including in generated code!) + https://github.com/python/mypy/blob/3c935bdd1332672f5daeae7f3f9a858a453333d4/mypyc/lib-rt/dict_ops.c#L76 + https://github.com/python/mypy/blob/3c935bdd1332672f5daeae7f3f9a858a453333d4/mypyc/lib-rt/dict_ops.c#L131 - "setdefault", "update" - uses _PyObject_CallMethodIdObjArgs(), _PyObject_CallMethodIdOneArg() + https://github.com/python/mypy/blob/2b7e2df923f7e4a3a199915b3c8563f45bc69dfa/mypyc/lib-rt/pythonsupport.h#L26 + https://github.com/python/mypy/blob/2b7e2df923f7e4a3a199915b3c8563f45bc69dfa/mypyc/lib-rt/pythonsupport.h#L109 - "__mro_entries__", "__init_subclass__" - uses _PyObject_LookupAttrId(), _PyObject_GetAttrId() + https://github.com/python/mypy/blob/2b7e2df923f7e4a3a199915b3c8563f45bc69dfa/mypyc/lib-rt/misc_ops.c#L27 + https://github.com/python/mypy/blob/2b7e2df923f7e4a3a199915b3c8563f45bc69dfa/mypyc/lib-rt/misc_ops.c#L47 - "send", "throw", "close" - uses _PyObject_CallMethodIdOneArg(), _PyObject_GetAttrId() + https://github.com/python/mypy/blob/8c5c915a89ec0f35b3e07332c7090e62f143043e/mypyc/lib-rt/bytes_ops.c#L104 - "join" - uses _PyObject_CallMethodIdOneArg() + https://github.com/python/mypy/blob/3c935bdd1332672f5daeae7f3f9a858a453333d4/mypyc/codegen/emitwrapper.py#L326 + https://github.com/python/mypy/blob/2b7e2df923f7e4a3a199915b3c8563f45bc69dfa/mypyc/lib-rt/misc_ops.c#L694 - uses _PyObject_GetAttrId()
    • pickle5 + https://github.com/pitrou/pickle5-backport/blob/e6117502435aba2901585cc6c692fb9582545f08/pickle5/_pickle.c#L224 + https://github.com/pitrou/pickle5-backport/blob/e6117502435aba2901585cc6c692fb9582545f08/pickle5/compat.h - "getattr" - uses _PyUnicode_FromId()
    • pysqlite3 + https://github.com/coleifer/pysqlite3/blob/f302859dc9ddb47a1089324dbca3873740b74af9/src/microprotocols.c#L103 + https://github.com/coleifer/pysqlite3/blob/f302859dc9ddb47a1089324dbca3873740b74af9/src/microprotocols.c#L119 - "__adapt__", "__conform__" - uses _PyObject_CallMethodId() + https://github.com/coleifer/pysqlite3/blob/093b88d1a58b141db8cf971c35ea1f6b674d0d02/src/connection.c#L51 + https://github.com/coleifer/pysqlite3/blob/093b88d1a58b141db8cf971c35ea1f6b674d0d02/src/connection.c#L772 - "finalize", "value", "upper", "cursor" - uses _PyObject_CallMethodId(), _PyObject_CallMethodIdObjArgs() + https://github.com/coleifer/pysqlite3/blob/49ce9c7a89a3c9f47ab8d32b6c4e2f7d629c1688/src/module.c#L195 - "upper" - uses _PyObject_CallMethodId() + https://github.com/coleifer/pysqlite3/blob/91b2664f525b19feedfca3f0913302c6f1e8be8a/src/cursor.c#L103 - "upper" - uses _PyObject_CallMethodId()
    • typed_ast + a fork of CPython's ast code
    • zodbpickle + a fork of CPython's pickle

    All of them should be trivial to drop _Py_IDENTIFIER() without any real performance impact or mess.

    Also, the following implies that PyPy has some sort of _Py_IDENTIFIER() support: https://github.com/benhoyt/scandir/blob/3396aa4155ffde8600a0e9ca50d5872569169b5d/_scandir.c#L51.

    ericsnowcurrently avatar Feb 04 '22 01:02 ericsnowcurrently

    New changeset 81c72044a181dbbfbf689d7a977d0d99090f26a8 by Eric Snow in branch 'main': bpo-46541: Replace core use of _Py_IDENTIFIER() with statically initialized global objects. (gh-30928) https://github.com/python/cpython/commit/81c72044a181dbbfbf689d7a977d0d99090f26a8

    ericsnowcurrently avatar Feb 08 '22 20:02 ericsnowcurrently

    Sorry if off topic, but I noticed that CPython doesn't deprecate macros in code, while with gcc/clang it's possible to show compiler warnings for them using some pragma magic:

    $ gcc a.c
    a.c: In function 'main':
    a.c:29:13: warning: Deprecated pre-processor symbol
       29 |     PySomethingDeprecated (0);
          |             ^~~~~~~~~~~~~~~~~~
    a.c:30:13: warning: Deprecated pre-processor symbol: replace with "SomethingCompletelyDifferent"
       30 |     PySomethingDeprecated2 (42);
          |             ^~~~~~~~~~~~~~~~~~~~
    

    Here is how glib implements this for example: https://gist.github.com/lazka/4749c74249a3918a059d944040aca4a3

    Maybe that makes getting rid of them easier in the long run?

    On Fri, Feb 11, 2022 at 1:36 AM Christoph Reiter <[email protected]> wrote:

    Sorry if off topic, but I noticed that CPython doesn't deprecate macros in code, while with gcc/clang it's possible to show compiler warnings for them using some pragma magic: [snip] Maybe that makes getting rid of them easier in the long run?

    That's a good question. We do have Py_DEPRECATED() (in Include/pyport.h), which is used for symbols. I'm not sure anyone has given much thought to deprecating macros, but it's probably worth considering. I recommend that you post something about this to [email protected].

    FWIW, here are other explanations of how to deprecate macros:

    • https://stackoverflow.com/questions/57478368/what-is-the-best-way-to-mark-macro-as-deprecated/57479189#57479189
    • https://stackoverflow.com/questions/2681259/how-to-deprecate-a-c-pre-processor-macro-in-gcc/29297970#29297970

    ericsnowcurrently avatar Feb 11 '22 16:02 ericsnowcurrently

    With core code sorted out, stdlib and 3rd party extension modules are left to sort out.

    I see the following possibilities:

    1. leave _Py_IDENTIFIER() alone (it is already disallowed in core code)
    2. change _Py_IDENTIFIER() to create static string objects (then get rid of global state)
    3. get rid of _Py_IDENTIFIER() a. provide a public alternative (needs a PEP) b. first work with 3rd party projects to stop using it

    ericsnowcurrently avatar Feb 14 '22 18:02 ericsnowcurrently

    New changeset 12360aa159c42c7798fd14225d271e6fd84db7eb by Eric Snow in branch 'main': bpo-46541: Discover the global strings. (gh-31346) https://github.com/python/cpython/commit/12360aa159c42c7798fd14225d271e6fd84db7eb

    ericsnowcurrently avatar Feb 15 '22 00:02 ericsnowcurrently

    New changeset 6c8958948666403f2370ca7b4c0a52b2010ec16d by Eric Snow in branch 'main': bpo-46541: Drop the check for orphaned global strings. (gh-31363) https://github.com/python/cpython/commit/6c8958948666403f2370ca7b4c0a52b2010ec16d

    ericsnowcurrently avatar Feb 16 '22 03:02 ericsnowcurrently

    New changeset 4d8a515d193a4c9f3844704f974ddb870d7ee383 by Eric Snow in branch 'main': bpo-46541: Scan Fewer Files in generate_global_objects.py (gh-31364) https://github.com/python/cpython/commit/4d8a515d193a4c9f3844704f974ddb870d7ee383

    ericsnowcurrently avatar Feb 16 '22 03:02 ericsnowcurrently

    New changeset e59309b9d0969d5c8f493cb8df28afa6f38d6273 by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from dbms modules (GH-31358) https://github.com/python/cpython/commit/e59309b9d0969d5c8f493cb8df28afa6f38d6273

    corona10 avatar Feb 16 '22 09:02 corona10

    New changeset d64f3caebe8f8e31ecd193e0bd25105400153ece by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from csv module (GH-31372) https://github.com/python/cpython/commit/d64f3caebe8f8e31ecd193e0bd25105400153ece

    corona10 avatar Feb 16 '22 15:02 corona10

    New changeset b2077117d125925210148294eefee28797b7ff4c by Erlend Egeberg Aasland in branch 'main': bpo-46541: Replace _Py_IDENTIFIER with _Py_ID in sqlite3 (GH-31351) https://github.com/python/cpython/commit/b2077117d125925210148294eefee28797b7ff4c

    corona10 avatar Feb 16 '22 15:02 corona10

    New changeset e8a19b092fc0551581e10d6f310dd5feabac7609 by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from mmap module (GH-31375) https://github.com/python/cpython/commit/e8a19b092fc0551581e10d6f310dd5feabac7609

    corona10 avatar Feb 16 '22 16:02 corona10

    (from https://github.com/python/cpython/pull/31376#issuecomment-1041836106)

    [corona10]

    Should we create the separate bpo issue if module changes are too noisy?

    I think it's fine to use the one issue. There are only 26 modules with NEEDS_PY_IDENTIFIER and none have much risk with the change. So it's unlikely that we'll get any discussion about any specific module. If we do, we can easily create an issue then for the module in question. If one of the modules had many uses of _Py_IDENTIFIER() then it might make sense to split it out, but IIRC none of them have very many.

    ericsnowcurrently avatar Feb 16 '22 19:02 ericsnowcurrently

    I think it's fine to use the one issue.

    Great! and thank you for all your efforts :)

    corona10 avatar Feb 17 '22 01:02 corona10

    New changeset 8cb5f707a841832aeac4b01c8b6a0e72dced52ae by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from array module (GH-31376) https://github.com/python/cpython/commit/8cb5f707a841832aeac4b01c8b6a0e72dced52ae

    corona10 avatar Feb 17 '22 04:02 corona10

    New changeset 2b86616456629e11de33629da1bb732f033c436e by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from pyexpat (GH-31468) https://github.com/python/cpython/commit/2b86616456629e11de33629da1bb732f033c436e

    corona10 avatar Feb 21 '22 14:02 corona10

    New changeset 088dd76dba68c2538776d9920607f81e54544cbd by Dong-hee Na in branch 'main': bpo-46541: Remove unnecessary Py_VISIT (GH-31608) https://github.com/python/cpython/commit/088dd76dba68c2538776d9920607f81e54544cbd

    tiran avatar Feb 28 '22 07:02 tiran

    New changeset c32aef48533769161e1247927a5b418322e0860c by Erlend Egeberg Aasland in branch 'main': bpo-46541: Remove unneeded visits from sqlite3 (GH-31609) https://github.com/python/cpython/commit/c32aef48533769161e1247927a5b418322e0860c

    corona10 avatar Feb 28 '22 09:02 corona10

    New changeset 0cc63641859b2f60ea65bb7c0b6d1cfcec1e2f1a by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from multibytecodec (GH-31475) https://github.com/python/cpython/commit/0cc63641859b2f60ea65bb7c0b6d1cfcec1e2f1a

    corona10 avatar Mar 01 '22 14:03 corona10

    New changeset e91b0a7139d4a4cbd2351ccb5cd021a100cf42d2 by Kumar Aditya in branch 'main': bpo-46541: remove usage of _Py_IDENTIFIER from _ssl module (GH-31599) https://github.com/python/cpython/commit/e91b0a7139d4a4cbd2351ccb5cd021a100cf42d2

    corona10 avatar Mar 01 '22 15:03 corona10

    New changeset d168c728f7114959e8fc147538ea1d24f2f5af79 by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from lzma module (GH-31683) https://github.com/python/cpython/commit/d168c728f7114959e8fc147538ea1d24f2f5af79

    corona10 avatar Mar 04 '22 16:03 corona10

    New changeset 21412d037b07c08266e96dfd0c0e44a1b7693bc1 by Eric Snow in branch 'main': bpo-46541: Add a Comment About When to Use _Py_DECLARE_STR(). (gh-32063) https://github.com/python/cpython/commit/21412d037b07c08266e96dfd0c0e44a1b7693bc1

    ericsnowcurrently avatar Mar 23 '22 15:03 ericsnowcurrently

    With https://github.com/python/cpython/pull/93597, the global identifier strings are interned and now can be used in-place of other cached interned strings to avoid allocations and fast comparisons with pointers and sub-interpreter compatible as bonus.

    kumaraditya303 avatar Jul 09 '22 10:07 kumaraditya303

    FTR, the stdlib extensions still using _Py_IDENTIFIER() (grep -l NEEDS_PY_IDENTIFIER **/*.c):

    • [x] Modules/_asynciomodule.c
    • [x] Modules/_ctypes/callbacks.c
    • [x] Modules/_ctypes/callproc.c
    • [x] Modules/_ctypes/_ctypes.c
    • [x] Modules/_ctypes/stgdict.c
    • [x] Modules/_cursesmodule.c
    • [x] Modules/_datetimemodule.c
    • [x] Modules/_elementtree.c
    • [x] Modules/_json.c
    • [x] Modules/ossaudiodev.c
    • [x] Modules/_testcapimodule.c
    • [x] PC/_msi.c

    ~(Also Programs/_testembed.c)~

    ericsnowcurrently avatar Oct 31 '22 16:10 ericsnowcurrently