cpython
cpython copied to clipboard
Replace _Py_IDENTIFIER() with statically initialized objects.
| BPO | 46541 |
|---|---|
| Nosy | @vstinner, @tiran, @ericsnowcurrently, @serhiy-storchaka, @corona10, @erlend-aasland, @kumaraditya303, @notarealdeveloper |
| PRs |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
assignee = 'https://github.com/ericsnowcurrently'
closed_at = None
created_at = <Date 2022-01-26.22:01:04.421>
labels = ['interpreter-core', '3.11']
title = 'Replace _Py_IDENTIFIER() with statically initialized objects.'
updated_at = <Date 2022-03-23.15:52:58.371>
user = 'https://github.com/ericsnowcurrently'
bugs.python.org fields:
activity = <Date 2022-03-23.15:52:58.371>
actor = 'eric.snow'
assignee = 'eric.snow'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core']
creation = <Date 2022-01-26.22:01:04.421>
creator = 'eric.snow'
dependencies = []
files = []
hgrepos = []
issue_num = 46541
keywords = ['patch']
message_count = 28.0
messages = ['411799', '411800', '411910', '411987', '412229', '412405', '412488', '412868', '413047', '413086', '413246', '413274', '413311', '413312', '413323', '413333', '413334', '413337', '413355', '413374', '413384', '413653', '414178', '414182', '414265', '414274', '414534', '415881']
nosy_count = 8.0
nosy_names = ['vstinner', 'christian.heimes', 'eric.snow', 'serhiy.storchaka', 'corona10', 'erlendaasland', 'kumaraditya', 'notarealdeveloper']
pr_nums = ['30928', '30941', '31261', '31344', '31346', '31351', '31358', '31363', '31364', '31372', '31375', '31376', '30310', '31468', '31475', '31599', '31608', '31609', '31683', '32063']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = None
url = 'https://bugs.python.org/issue46541'
versions = ['Python 3.11']
- PR: gh-98956
- PR: gh-98957
- PR: gh-99010
- PR: gh-99012
- PR: gh-99054
- PR: gh-99067
- PR: gh-99210
- PR: gh-99236
_Py_Identifier has been useful but at this point there is a faster and simpler approach we could take as a replacement: statically initialize the objects as fields on _PyRuntimeState and reference them directly through a macro.
This would involve the following:
- add a
PyUnicodeObject field (not a pointer) to_PyRuntimeStatefor each string that currently uses_Py_IDENTIFIER()` - initialize each object as part of the static initializer for
_PyRuntimeState - make each "immortal" (e.g. start with a really high refcount)
- add a macro to look up a given string
- update each location that currently uses
_Py_IDENTIFIER()to use the new macro instead
As part of this, we would also do the following:
- get rid of all C-API functions with
_Py_Identiferparameters - get rid of the old runtime state related to identifiers
- get rid of
_Py_Identifier,_Py_IDENTIFIER(), etc.
(Note that there are several hundred uses of _Py_IDENTIFIER(), including a number of duplicates.)
Pros:
- reduces indirection (and extra calls) for C-API using the strings (making the code easier to understand and speeding it up)
- the objects are referenced from a fixed address in the static data section (speeding things up and allowing the C compiler to optimize better)
- there is no lazy allocation (or lookup, etc.) so there are fewer possible failures when the objects get used (thus less error return checking)
- simplifies the runtime state
- saves memory (at little, at least)
- the approach for per-interpreter is simpler (if needed)
- reduces the number of static variables in any given C module
- reduces the number of functions in the ("private") C-API
- "deep frozen" modules can use these strings
- other commonly-used strings could be pre-allocated by adding
_PyRuntimeStatefields for them
Cons:
- churn
- adding a string to the list requires modifying a separate file from the one where you actually want to use the string
- strings can get "orphaned" (we could prevent this with a check in
make check) - some PyPI packages may rely on
_Py_IDENTIFIER()(even though it is "private" C-API) - some strings may never get used for any given ./python invocation
Note that with a basic partial implementation (GH-30928) I'm seeing a 1% improvement in performance (see https://github.com/faster-cpython/ideas/issues/230).
## Background ##
_Py_Identifier (and _Py_IDENTIFIER(), etc.) was added in 2011 [1][2] for several reasons:
- provide a consistent approach for a common optimization: caching C-string based string objects
- facilitate freeing those objects during runtime finalization
The solution involved using a static variable defined, using _Py_IDENTIFIER() near the code that needed the string. The variable (a _Py_Identifier) would hold the desired C string (statically initialized) and the corresponding (lazily created) PyUnicodeObject. The code where the _Py_Identifier was defined would then pass it to specialized versions of various C-API that would normally consume a C string or PyUnicodeObject. Then that code would use either the C-string or the object (creating and caching it first if not done already). This approach decentralized the caching but also provided a single tracking mechanism that made it easier to clean up the objects.
Over the last decade a number of changes were made, including recent changes to make the identifiers per-interpreter and to use a centralized cache.
[1] https://github.com/python/cpython/commit/afe55bba33a20f87a58f940186359237064b428f [2] https://mail.python.org/archives/list/[email protected]/message/FRUTTE47JO2XN3LXV2J4VB5A5VILILLA/
New changeset 247480a21cb165efdacc346a2d589dfc27e18283 by Eric Snow in branch 'main': bpo-46541: Generate the global objects initializer. (gh-30941) https://github.com/python/cpython/commit/247480a21cb165efdacc346a2d589dfc27e18283
_Py_Identifierhas been useful but at this point there is a faster and simpler approach we could take as a replacement: statically initialize the objects as fields on_PyRuntimeStateand reference them directly through a macro.
This change is going to break projects in the wild. Yes, people use the _Py_IDENTIFIER(), _PyUnicode_FromId() and other "Id" variant of many functions in 3rd party projects.
Is it possible to keep runtime initialization if this API is used by 3rd party code?
If necessary, we can keep _Py_IDENTIFIER() (and the functions). Regardless, we can stop using it internally.
FYI, I've posted to python-dev for feedback before proceeding:
https://mail.python.org/archives/list/[email protected]/thread/DNMZAMB4M6RVR76RDZMUK2WRLI6KAAYS/
(thanks Victor: https://mail.python.org/archives/list/[email protected]/message/7RMLIJHUWVBZFV747TFEHOE6LNBVQSMM/)
3rd party use of _Py_IDENTIFIER():
- blender + https://github.com/blender/blender/blob/master/source/blender/python/intern/bpy_traceback.c#L53 - copied from core code - "msg", "filename", "lineno", "offset", "text", "<string>" - uses _PyObject_GetAttrId()
- datatable + https://github.com/h2oai/datatable/blob/45a87337bc68576c7fb6900f524925d4fb77d6a6/src/core/python/obj.cc#L76 - in C++ wrapper getting sys.stdout, etc. and writing to sys.stdout - has to hack around C++14 support - has a fallback under limited API - "stdin", "stdout", "stderr", "write" - uses _PySys_GetObjectId(), _PyObject_GetAttrId()
- multidict (in aiohttp) + https://github.com/aio-libs/multidict/blob/6dedb623cca8e8fe64f502dfa479826efc321385/multidict/_multilib/defs.h#L8 + https://github.com/aio-libs/multidict/blob/6dedb623cca8e8fe64f502dfa479826efc321385/multidict/_multilib/istr.h#L46 + https://github.com/aio-libs/multidict/blob/6dedb623cca8e8fe64f502dfa479826efc321385/multidict/_multilib/pair_list.h#L114 - calling str.lower() - "lower" - uses _PyObject_CallMethodId()
- mypy (exclusively in mypyc, including in generated code!) + https://github.com/python/mypy/blob/3c935bdd1332672f5daeae7f3f9a858a453333d4/mypyc/lib-rt/dict_ops.c#L76 + https://github.com/python/mypy/blob/3c935bdd1332672f5daeae7f3f9a858a453333d4/mypyc/lib-rt/dict_ops.c#L131 - "setdefault", "update" - uses _PyObject_CallMethodIdObjArgs(), _PyObject_CallMethodIdOneArg() + https://github.com/python/mypy/blob/2b7e2df923f7e4a3a199915b3c8563f45bc69dfa/mypyc/lib-rt/pythonsupport.h#L26 + https://github.com/python/mypy/blob/2b7e2df923f7e4a3a199915b3c8563f45bc69dfa/mypyc/lib-rt/pythonsupport.h#L109 - "__mro_entries__", "__init_subclass__" - uses _PyObject_LookupAttrId(), _PyObject_GetAttrId() + https://github.com/python/mypy/blob/2b7e2df923f7e4a3a199915b3c8563f45bc69dfa/mypyc/lib-rt/misc_ops.c#L27 + https://github.com/python/mypy/blob/2b7e2df923f7e4a3a199915b3c8563f45bc69dfa/mypyc/lib-rt/misc_ops.c#L47 - "send", "throw", "close" - uses _PyObject_CallMethodIdOneArg(), _PyObject_GetAttrId() + https://github.com/python/mypy/blob/8c5c915a89ec0f35b3e07332c7090e62f143043e/mypyc/lib-rt/bytes_ops.c#L104 - "join" - uses _PyObject_CallMethodIdOneArg() + https://github.com/python/mypy/blob/3c935bdd1332672f5daeae7f3f9a858a453333d4/mypyc/codegen/emitwrapper.py#L326 + https://github.com/python/mypy/blob/2b7e2df923f7e4a3a199915b3c8563f45bc69dfa/mypyc/lib-rt/misc_ops.c#L694 - uses _PyObject_GetAttrId()
- pickle5 + https://github.com/pitrou/pickle5-backport/blob/e6117502435aba2901585cc6c692fb9582545f08/pickle5/_pickle.c#L224 + https://github.com/pitrou/pickle5-backport/blob/e6117502435aba2901585cc6c692fb9582545f08/pickle5/compat.h - "getattr" - uses _PyUnicode_FromId()
- pysqlite3 + https://github.com/coleifer/pysqlite3/blob/f302859dc9ddb47a1089324dbca3873740b74af9/src/microprotocols.c#L103 + https://github.com/coleifer/pysqlite3/blob/f302859dc9ddb47a1089324dbca3873740b74af9/src/microprotocols.c#L119 - "__adapt__", "__conform__" - uses _PyObject_CallMethodId() + https://github.com/coleifer/pysqlite3/blob/093b88d1a58b141db8cf971c35ea1f6b674d0d02/src/connection.c#L51 + https://github.com/coleifer/pysqlite3/blob/093b88d1a58b141db8cf971c35ea1f6b674d0d02/src/connection.c#L772 - "finalize", "value", "upper", "cursor" - uses _PyObject_CallMethodId(), _PyObject_CallMethodIdObjArgs() + https://github.com/coleifer/pysqlite3/blob/49ce9c7a89a3c9f47ab8d32b6c4e2f7d629c1688/src/module.c#L195 - "upper" - uses _PyObject_CallMethodId() + https://github.com/coleifer/pysqlite3/blob/91b2664f525b19feedfca3f0913302c6f1e8be8a/src/cursor.c#L103 - "upper" - uses _PyObject_CallMethodId()
- typed_ast + a fork of CPython's ast code
- zodbpickle + a fork of CPython's pickle
All of them should be trivial to drop _Py_IDENTIFIER() without any real performance impact or mess.
Also, the following implies that PyPy has some sort of _Py_IDENTIFIER() support: https://github.com/benhoyt/scandir/blob/3396aa4155ffde8600a0e9ca50d5872569169b5d/_scandir.c#L51.
New changeset 81c72044a181dbbfbf689d7a977d0d99090f26a8 by Eric Snow in branch 'main': bpo-46541: Replace core use of _Py_IDENTIFIER() with statically initialized global objects. (gh-30928) https://github.com/python/cpython/commit/81c72044a181dbbfbf689d7a977d0d99090f26a8
Sorry if off topic, but I noticed that CPython doesn't deprecate macros in code, while with gcc/clang it's possible to show compiler warnings for them using some pragma magic:
$ gcc a.c
a.c: In function 'main':
a.c:29:13: warning: Deprecated pre-processor symbol
29 | PySomethingDeprecated (0);
| ^~~~~~~~~~~~~~~~~~
a.c:30:13: warning: Deprecated pre-processor symbol: replace with "SomethingCompletelyDifferent"
30 | PySomethingDeprecated2 (42);
| ^~~~~~~~~~~~~~~~~~~~
Here is how glib implements this for example: https://gist.github.com/lazka/4749c74249a3918a059d944040aca4a3
Maybe that makes getting rid of them easier in the long run?
On Fri, Feb 11, 2022 at 1:36 AM Christoph Reiter <[email protected]> wrote:
Sorry if off topic, but I noticed that CPython doesn't deprecate macros in code, while with gcc/clang it's possible to show compiler warnings for them using some pragma magic: [snip] Maybe that makes getting rid of them easier in the long run?
That's a good question. We do have Py_DEPRECATED() (in Include/pyport.h), which is used for symbols. I'm not sure anyone has given much thought to deprecating macros, but it's probably worth considering. I recommend that you post something about this to [email protected].
FWIW, here are other explanations of how to deprecate macros:
- https://stackoverflow.com/questions/57478368/what-is-the-best-way-to-mark-macro-as-deprecated/57479189#57479189
- https://stackoverflow.com/questions/2681259/how-to-deprecate-a-c-pre-processor-macro-in-gcc/29297970#29297970
With core code sorted out, stdlib and 3rd party extension modules are left to sort out.
I see the following possibilities:
- leave
_Py_IDENTIFIER()alone (it is already disallowed in core code) - change
_Py_IDENTIFIER()to create static string objects (then get rid of global state) - get rid of
_Py_IDENTIFIER()a. provide a public alternative (needs a PEP) b. first work with 3rd party projects to stop using it
New changeset 12360aa159c42c7798fd14225d271e6fd84db7eb by Eric Snow in branch 'main': bpo-46541: Discover the global strings. (gh-31346) https://github.com/python/cpython/commit/12360aa159c42c7798fd14225d271e6fd84db7eb
New changeset 6c8958948666403f2370ca7b4c0a52b2010ec16d by Eric Snow in branch 'main': bpo-46541: Drop the check for orphaned global strings. (gh-31363) https://github.com/python/cpython/commit/6c8958948666403f2370ca7b4c0a52b2010ec16d
New changeset 4d8a515d193a4c9f3844704f974ddb870d7ee383 by Eric Snow in branch 'main': bpo-46541: Scan Fewer Files in generate_global_objects.py (gh-31364) https://github.com/python/cpython/commit/4d8a515d193a4c9f3844704f974ddb870d7ee383
New changeset e59309b9d0969d5c8f493cb8df28afa6f38d6273 by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from dbms modules (GH-31358) https://github.com/python/cpython/commit/e59309b9d0969d5c8f493cb8df28afa6f38d6273
New changeset d64f3caebe8f8e31ecd193e0bd25105400153ece by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from csv module (GH-31372) https://github.com/python/cpython/commit/d64f3caebe8f8e31ecd193e0bd25105400153ece
New changeset b2077117d125925210148294eefee28797b7ff4c by Erlend Egeberg Aasland in branch 'main': bpo-46541: Replace _Py_IDENTIFIER with _Py_ID in sqlite3 (GH-31351) https://github.com/python/cpython/commit/b2077117d125925210148294eefee28797b7ff4c
New changeset e8a19b092fc0551581e10d6f310dd5feabac7609 by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from mmap module (GH-31375) https://github.com/python/cpython/commit/e8a19b092fc0551581e10d6f310dd5feabac7609
(from https://github.com/python/cpython/pull/31376#issuecomment-1041836106)
[corona10]
Should we create the separate bpo issue if module changes are too noisy?
I think it's fine to use the one issue. There are only 26 modules with NEEDS_PY_IDENTIFIER and none have much risk with the change. So it's unlikely that we'll get any discussion about any specific module. If we do, we can easily create an issue then for the module in question. If one of the modules had many uses of _Py_IDENTIFIER() then it might make sense to split it out, but IIRC none of them have very many.
I think it's fine to use the one issue.
Great! and thank you for all your efforts :)
New changeset 8cb5f707a841832aeac4b01c8b6a0e72dced52ae by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from array module (GH-31376) https://github.com/python/cpython/commit/8cb5f707a841832aeac4b01c8b6a0e72dced52ae
New changeset 2b86616456629e11de33629da1bb732f033c436e by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from pyexpat (GH-31468) https://github.com/python/cpython/commit/2b86616456629e11de33629da1bb732f033c436e
New changeset 088dd76dba68c2538776d9920607f81e54544cbd by Dong-hee Na in branch 'main': bpo-46541: Remove unnecessary Py_VISIT (GH-31608) https://github.com/python/cpython/commit/088dd76dba68c2538776d9920607f81e54544cbd
New changeset c32aef48533769161e1247927a5b418322e0860c by Erlend Egeberg Aasland in branch 'main': bpo-46541: Remove unneeded visits from sqlite3 (GH-31609) https://github.com/python/cpython/commit/c32aef48533769161e1247927a5b418322e0860c
New changeset 0cc63641859b2f60ea65bb7c0b6d1cfcec1e2f1a by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from multibytecodec (GH-31475) https://github.com/python/cpython/commit/0cc63641859b2f60ea65bb7c0b6d1cfcec1e2f1a
New changeset e91b0a7139d4a4cbd2351ccb5cd021a100cf42d2 by Kumar Aditya in branch 'main': bpo-46541: remove usage of _Py_IDENTIFIER from _ssl module (GH-31599) https://github.com/python/cpython/commit/e91b0a7139d4a4cbd2351ccb5cd021a100cf42d2
New changeset d168c728f7114959e8fc147538ea1d24f2f5af79 by Dong-hee Na in branch 'main': bpo-46541: Remove usage of _Py_IDENTIFIER from lzma module (GH-31683) https://github.com/python/cpython/commit/d168c728f7114959e8fc147538ea1d24f2f5af79
New changeset 21412d037b07c08266e96dfd0c0e44a1b7693bc1 by Eric Snow in branch 'main': bpo-46541: Add a Comment About When to Use _Py_DECLARE_STR(). (gh-32063) https://github.com/python/cpython/commit/21412d037b07c08266e96dfd0c0e44a1b7693bc1
With https://github.com/python/cpython/pull/93597, the global identifier strings are interned and now can be used in-place of other cached interned strings to avoid allocations and fast comparisons with pointers and sub-interpreter compatible as bonus.
FTR, the stdlib extensions still using _Py_IDENTIFIER() (grep -l NEEDS_PY_IDENTIFIER **/*.c):
- [x] Modules/_asynciomodule.c
- [x] Modules/_ctypes/callbacks.c
- [x] Modules/_ctypes/callproc.c
- [x] Modules/_ctypes/_ctypes.c
- [x] Modules/_ctypes/stgdict.c
- [x] Modules/_cursesmodule.c
- [x] Modules/_datetimemodule.c
- [x] Modules/_elementtree.c
- [x] Modules/_json.c
- [x] Modules/ossaudiodev.c
- [x] Modules/_testcapimodule.c
- [x] PC/_msi.c
~(Also Programs/_testembed.c)~