pywin32 icon indicating copy to clipboard operation
pywin32 copied to clipboard

Replace usage of old unicode API removed in Py3.12/PEP 623

Open kxrob opened this issue 3 years ago • 1 comments

The "legacy" Unicode object will be removed in Python 3.12
with deprecated APIs. All Unicode objects will be "canonical"
since then. See PEP 623 for more information.

Those old APIs were still used in pywin32:

  • PyUnicode_AsUnicode
  • PyUnicode_GetSize
  • PyUnicode_AS_UNICODE
  • PyUnicode_GET_SIZE, PyUnicode_GET_DATA_SIZE
  • PyUnicode_FromUnicode
  • PyUnicode_EncodeMBCS
  • u u# Z Z# in PyArg_Parse... format strings

kxrob avatar Apr 14 '22 13:04 kxrob

The first Py3.12 alpha should appear in October on github for testing the deprecation replacements fully: https://peps.python.org/pep-0693/

kxrob avatar Aug 14 '22 15:08 kxrob

I still need to get my head around the new macros - it adds alot of complexity I'm still untangling. I do like the ability it has to provide the PyObject** for PyArg_ParseTuple. ... hope there's something that can be done to make that part of this easier to understand.

This strange macro mechanism (U2WREC, U2WCONV, u2w ..) and auxiliary union attribute stuff (handling source/target addresses for conversion) probably just should be dropped: It came only with the last 2 commits (replace "u" and "Z" in PyArg_Parse... ). The only purpose was to not repetitively make extra PyObject* intermediate variables and extra statements doing PyWinObject_AsWCHAR in the PyArg_ParseTuple... use cases (like the cases where TmpWCHAR was used previously ~100x) - but doing it within TmpWCHAR itself and the PyArg_ParseTuple statement and to automatically provide specific help text upon error.
But there are only few new use cases for that. So that strange mechanism does not pay off and its not worth to compress those existing 100 cases. So its probably best to just remove those 2 commits, keep it easy to read, and write out those few new PyArg_ParseTuple use cases using the existing style. Regarding the single big bulk use case in the last commit (bulk "Z") a specific local macro and aux. array(s) can be used for conversions in a loop. So I'm simply going to drop those 2 commits so far - handling the u and Z arg parse cases in an extra PR later ...

almost every existing use of TmpWCHAR should eventually move to this new mechanism

(well, it doesn't save typing and testing anymore, may not be worth touching and organizing all this in a well readable way... ?)

A new object, say, PyWin_WCHAR or similar, which looks like you changes here, but unlike TmpWCHAR, never takes ownership of memory allocated elsewhere. It only supports construction/initialization via a PyObject *.

Besides the above dropped mechanism TmpWCHAR so far would only gain the function to do auto PyUnicode_AsWideCharString at assignment / construction time (2nd commit). That could become an extra class / name / sub class as well. But so far there is not really a separate purpose (freeing the held temp string).

if we can determine the object's PyUnicode_KIND is PyUnicode_4BYTE_KIND we could still borrow the buffer?

For potentially saving a PyUnicode_AsWideCharString in the (rare?) case of PyUnicode_2BYTE_KIND, it seems the canonical state must be guaranteed first (PyUnicode_READY(), extra cost?, otherwise the string representation could change suddenly), then checked (again). There is also a (non-canonical?) PyUnicode_WCHAR_KIND. Is Py_UCS2* / PyUnicode_2BYTE_KIND always a valid NULL terminated Windows WCHAR string? If this works, is fast and is worth it, there would be an extra PyObject* (inc/decref) to the python string in the holder - being a flag at the same time. Could still happen in the same class.

kxrob avatar Sep 05 '22 09:09 kxrob

Sorry for the delay and thanks for persevering!

mhammond avatar Sep 17 '22 04:09 mhammond