blog
blog copied to clipboard
Python中的字符串对象
Python中字符串对象和整数对象一样,都是不可变对象
immutable
。
Python2中的字符串有两种:普通字符串(8-bit strings
)和Unicode字符串(unicode strings
),分别对应着str
(PyStringObject
)和unicode
(PyUnicodeObject
)两个字符串对象,都是抽象类型basestring
的子类:
>>> str.__base__
<type 'basestring'>
>>> unicode.__base__
<type 'basestring'>
basestring() This abstract type is the superclass for str and unicode. It cannot be called or instantiated, but it can be used to test whether an object is an instance of str or unicode. isinstance(obj, basestring) is equivalent to isinstance(obj, (str, unicode)). New in version 2.3.
bytes
是str
的别名:
>>> bytes
<type 'str'>
Python3中用文本(text
)和二进制数据((binary) data
)代替了Unicode字符串和普通字符串,分别对应着str
(PyUnicodeObject
)和bytes
(PyBytesObject
)两种对象。并且抽象类型basestring
也被移除了。
Text Vs. Data Instead Of Unicode Vs. 8-bit 描述了在Python3中字符串对象的改变。
另外,在Python 3.3之前,字符串在内存中以16位或32位存储各个码位。
Before Python 3.3, CPython could be compiled to use either 16 or 32 bits per code point in RAM; the formar was a "narrow build," and the latter a "wide build." To know which you have, check the value of sys.maxunicode: 65535 implies a "narrow build" that can't handle code points above U+FFFF transparently. A "wide build" doesn't have this limitation, but consumes a lot of memory: 4 bytes per character, even while the vast majority of code points for Chinese idegoraphs fit in 2 bytes. Neither option was great, so you had to choose depending on your needs. Since Python 3.3, when craeting a new str object, the interpreter checks the characters in it and chooses the most economic memory layout that is suitable for that particular str: if there are only characters in the latin1 range, that str will use just one byte per code point, Otherwise, 2 or 4 bytes per code point may be used, depending on the str. This is simplification; for the full details, look up PEP 393 -- Flexible String Representation from: Fluent Python: Clear, Concise, and Effective Programming -- Luciano Ramalho
Python 3.3之后Python内部采用多种编码方案来存储Unicode字符串对象。
0x00 字符串对象的存储方式
PyUnicodeObject
定义在Include/unicodeobject.h
中:
/* ASCII-only strings created through PyUnicode_New use the PyASCIIObject
structure. state.ascii and state.compact are set, and the data
immediately follow the structure. utf8_length and wstr_length can be found
in the length field; the utf8 pointer is equal to the data pointer. */
typedef struct {
/* There are 4 forms of Unicode strings:
- compact ascii:
* structure = PyASCIIObject
* test: PyUnicode_IS_COMPACT_ASCII(op)
* kind = PyUnicode_1BYTE_KIND
* compact = 1
* ascii = 1
* ready = 1
* (length is the length of the utf8 and wstr strings)
* (data starts just after the structure)
* (since ASCII is decoded from UTF-8, the utf8 string are the data)
- compact:
* structure = PyCompactUnicodeObject
* test: PyUnicode_IS_COMPACT(op) && !PyUnicode_IS_ASCII(op)
* kind = PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or
PyUnicode_4BYTE_KIND
* compact = 1
* ready = 1
* ascii = 0
* utf8 is not shared with data
* utf8_length = 0 if utf8 is NULL
* wstr is shared with data and wstr_length=length
if kind=PyUnicode_2BYTE_KIND and sizeof(wchar_t)=2
or if kind=PyUnicode_4BYTE_KIND and sizeof(wchar_t)=4
* wstr_length = 0 if wstr is NULL
* (data starts just after the structure)
- legacy string, not ready:
* structure = PyUnicodeObject
* test: kind == PyUnicode_WCHAR_KIND
* length = 0 (use wstr_length)
* hash = -1
* kind = PyUnicode_WCHAR_KIND
* compact = 0
* ascii = 0
* ready = 0
* interned = SSTATE_NOT_INTERNED
* wstr is not NULL
* data.any is NULL
* utf8 is NULL
* utf8_length = 0
- legacy string, ready:
* structure = PyUnicodeObject structure
* test: !PyUnicode_IS_COMPACT(op) && kind != PyUnicode_WCHAR_KIND
* kind = PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or
PyUnicode_4BYTE_KIND
* compact = 0
* ready = 1
* data.any is not NULL
* utf8 is shared and utf8_length = length with data.any if ascii = 1
* utf8_length = 0 if utf8 is NULL
* wstr is shared with data.any and wstr_length = length
if kind=PyUnicode_2BYTE_KIND and sizeof(wchar_t)=2
or if kind=PyUnicode_4BYTE_KIND and sizeof(wchar_4)=4
* wstr_length = 0 if wstr is NULL
Compact strings use only one memory block (structure + characters),
whereas legacy strings use one block for the structure and one block
for characters.
Legacy strings are created by PyUnicode_FromUnicode() and
PyUnicode_FromStringAndSize(NULL, size) functions. They become ready
when PyUnicode_READY() is called.
See also _PyUnicode_CheckConsistency().
*/
PyObject_HEAD
Py_ssize_t length; /* Number of code points in the string */
Py_hash_t hash; /* Hash value; -1 if not set */
struct {
/*
SSTATE_NOT_INTERNED (0)
SSTATE_INTERNED_MORTAL (1)
SSTATE_INTERNED_IMMORTAL (2)
If interned != SSTATE_NOT_INTERNED, the two references from the
dictionary to this object are *not* counted in ob_refcnt.
*/
unsigned int interned:2;
/* Character size:
- PyUnicode_WCHAR_KIND (0):
* character type = wchar_t (16 or 32 bits, depending on the
platform)
- PyUnicode_1BYTE_KIND (1):
* character type = Py_UCS1 (8 bits, unsigned)
* all characters are in the range U+0000-U+00FF (latin1)
* if ascii is set, all characters are in the range U+0000-U+007F
(ASCII), otherwise at least one character is in the range
U+0080-U+00FF
- PyUnicode_2BYTE_KIND (2):
* character type = Py_UCS2 (16 bits, unsigned)
* all characters are in the range U+0000-U+FFFF (BMP)
* at least one character is in the range U+0100-U+FFFF
- PyUnicode_4BYTE_KIND (4):
* character type = Py_UCS4 (32 bits, unsigned)
* all characters are in the range U+0000-U+10FFFF
* at least one character is in the range U+10000-U+10FFFF
*/
unsigned int kind:3;
/* Compact is with respect to the allocation scheme. Compact unicode
objects only require one memory block while non-compact objects use
one block for the PyUnicodeObject struct and another for its data
buffer. */
unsigned int compact:1;
/* The string only contains characters in the range U+0000-U+007F (ASCII)
and the kind is PyUnicode_1BYTE_KIND. If ascii is set and compact is
set, use the PyASCIIObject structure. */
unsigned int ascii:1;
/* The ready flag indicates whether the object layout is initialized
completely. This means that this is either a compact object, or
the data pointer is filled out. The bit is redundant, and helps
to minimize the test in PyUnicode_IS_READY(). */
unsigned int ready:1;
/* Padding to ensure that PyUnicode_DATA() is always aligned to
4 bytes (see issue #19537 on m68k). */
unsigned int :24;
} state;
wchar_t *wstr; /* wchar_t representation (null-terminated) */
} PyASCIIObject;
/* Non-ASCII strings allocated through PyUnicode_New use the
PyCompactUnicodeObject structure. state.compact is set, and the data
immediately follow the structure. */
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length; /* Number of bytes in utf8, excluding the
* terminating \0. */
char *utf8; /* UTF-8 representation (null-terminated) */
Py_ssize_t wstr_length; /* Number of code points in wstr, possible
* surrogates count as two code points. */
} PyCompactUnicodeObject;
/* Strings allocated through PyUnicode_FromUnicode(NULL, len) use the
PyUnicodeObject structure. The actual string data is initially in the wstr
block, and copied into the data block using _PyUnicode_Ready. */
typedef struct {
PyCompactUnicodeObject _base;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data; /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;
从上面大篇幅的定义和注释我们可以知道Unicode字符串有4种形式:
- compact ascii
- compact
- legacy string, not ready
- legacy string, ready
compact string
只使用一个内存块(结构体+字符)来存储,也就是内存是紧凑的,数据紧跟在结构体后面。compact string
只包含latin1
编码范围的字符,ASCII-only
(7-bits/U+0000-U+007F
范围内的字符)保存在PyASCIIObject
结构体中,Non-ASCII
(8-bits/一个字节大小的字符)保存在PyCompactUnicode
结构体中。legacy string
使用一个内存块来保存PyUnicodeObject
结构体,另一个内存块来保存字符。
整个字符串对象是以PyASCIIObject
为基础并扩展的。
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Number of code points in the string */
Py_hash_t hash; /* Hash value; -1 if not set */
struct {
unsigned int interned:2;
unsigned int kind:3;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
unsigned int :24;
} state;
wchar_t *wstr;
} PyASCIIObject;
其中,length
中保存了字符串中的码位(code points
)的数量,hash
中缓存了字符串对象的hash
值,因为字符串对象是不可变的,可以避免每次都重新计算该字符串对象的hash值,state
保存了对象的一些信息,wstr
是真实的字符串值,并且以NULL("\0"
)结尾,和c中的字符串一样。
state
中除了和对象的4种形式有关的值(kind
/compact
/ascii
/ready
)以外,还有一个interned
值。interned
标记了该对象是否已经通过intern
机制的处理。
intern机制
PyStringObject对象的intern机制之目的是:对于被intern之后的字符串,比如“Ruby”,在整个Python的运行期间,系统中都只有唯一的一个与字符串“Ruby”对应的PyStringObject对象。这样当判断两个PyStringObject对象是否相同时,如果它们都被intern了,那么只需要简单地检查它们对应的PyObject*是否相同即可。这个机制既节省了空间,又简化了对PyStringObject对象的比较,嗯,可谓是一箭双雕哇。 摘自:《Python源码剖析》 — 陈儒
Python3中PyUnicodeObject
对象的intern
机制和Python2的PyStringObject
对象intern
机制一样,主要为了节省内存的开销,利用字符串对象的不可变性,对存在的字符串对象重复利用。
>>> a = 'Python'
>>> b = 'Python'
>>> a is b
True
PyUnicode_InternInPlace
函数实现了intern
机制,定义在Objects/unicodeobject.c
:
/* This dictionary holds all interned unicode strings. Note that references
to strings in this dictionary are *not* counted in the string's ob_refcnt.
When the interned string reaches a refcnt of 0 the string deallocation
function will delete the reference from this dictionary.
Another way to look at this is that to say that the actual reference
count of a string is: s->ob_refcnt + (s->state ? 2 : 0)
*/
static PyObject *interned = NULL;
...
void
PyUnicode_InternInPlace(PyObject **p)
{
PyObject *s = *p;
PyObject *t;
#ifdef Py_DEBUG
assert(s != NULL);
assert(_PyUnicode_CHECK(s));
#else
if (s == NULL || !PyUnicode_Check(s))
return;
#endif
/* If it's a subclass, we don't really know what putting
it in the interned dict might do. */
if (!PyUnicode_CheckExact(s))
return;
// [1]
if (PyUnicode_CHECK_INTERNED(s))
return;
if (interned == NULL) {
interned = PyDict_New();
if (interned == NULL) {
PyErr_Clear(); /* Don't leave an exception */
return;
}
}
Py_ALLOW_RECURSION
// [2]
t = PyDict_SetDefault(interned, s, s);
Py_END_ALLOW_RECURSION
if (t == NULL) {
PyErr_Clear();
return;
}
// [3]
if (t != s) {
Py_INCREF(t);
Py_SETREF(*p, t);
return;
}
// [4]
/* The two references in interned are not counted by refcnt.
The deallocator will take care of this */
Py_REFCNT(s) -= 2;
_PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}
通过函数我们可以得知,Python中维护着一个intarned
变量指针,这个变量指向PyDict_New
创建的对象,而PyDict_New
实际上创建了一个PyDictObject
对象,是Python中dict
类型的对象。实际上intern机制就是维护一个字典,这个字典中记录着被intern机制处理过的字符串对象,[1]
处PyUnicode_CHECK_INTERNED
宏检查字符串对象的state.interned
是否被标记,PyUnicode_CHECK_INTERNED
宏在Include/unicodeobject.h
定义着:
/* Use only if you know it's a string */
#define PyUnicode_CHECK_INTERNED(op) \
(((PyASCIIObject *)(op))->state.interned)
如果字符串对象的state.interned
被标记了,就直接返回;[2]
处尝试把没有被标记的字符串对象s
作为key-value
加入interned
字典中;[3]
处表示字符串对象s
已经在interned
字典中(对应的value值是字符串对象t
),把存在interned
字典中的字符串对象s
key对应的value值字符串对象t
增加引用计数并返回字符串对象t
(通过Py_SETREF
宏来改变p指针的指向),且原字符串对象p
会因引用计数为零被回收,Py_SETREF
宏在Include/object.h
定义着:
/* Safely decref `op` and set `op` to `op2`.
*
* As in case of Py_CLEAR "the obvious" code can be deadly:
*
* Py_DECREF(op);
* op = op2;
*
* The safe way is:
*
* Py_SETREF(op, op2);
*
* That arranges to set `op` to `op2` _before_ decref'ing, so that any code
* triggered as a side-effect of `op` getting torn down no longer believes
* `op` points to a valid object.
*
* Py_XSETREF is a variant of Py_SETREF that uses Py_XDECREF instead of
* Py_DECREF.
*/
#define Py_SETREF(op, op2) \
do { \
PyObject *_py_tmp = (PyObject *)(op); \
(op) = (op2); \
Py_DECREF(_py_tmp); \
} while (0)
[4]
中把新加入interned
字典中的字符串对象做减引用操作,并把state.interned
标记成SSTATE_INTERNED_MORTAL
。SSTATE_INTERNED_MORTAL
表示字符串对象被intern机制处理,但会随着引用计数被回收;interned
标记还有另外一种SSTATE_INTERNED_IMMORTAL
,表示被intern机制处理但对象不可销毁,会与Python解释器同在。
这边需要解释下为什么是引用计数-2的操作。
对于被intern机制处理了的PyStringObject对象,Python采用了特殊的引用计数机制。在将一个PyStringObject对象a的PyObject指针作为key和value添加到interned中时,PyDictObject对象会通过这两个指针对a的引用计数进行两次加1的操作。但是Python的设计者规定在interned中a的指针不能被视为对象a的有效引用,因为如果是有效引用的话,那么a的引用计数在Python结束之前永远都不可能为0,因为interned中至少有两个指针引用了a,那么删除a就永远不可能了,这显然是没有道理的。 摘自:《Python源码剖析》 — 陈儒
实际上,即使Python会对一个字符串进行intern机制的处理,也会先创建一个PyUnicodeObject
对象,然后检查在interned
字典中是否有值和其相同,存在的话就将interned
字典保存的value值返回,之前临时创建的字符串对象会由于引用计数为零而回收。intern机制处理过程(取自《Python源码剖析》):
0x01 字符串对象的创建
从前面PyUnicodeObject
定义中的注释我们可以知道字符串对象是调用PyUnicode_FromUnicode
函数创建的,在Objects/unicodeobject.c
定义了:
PyObject *
PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
{
if (u == NULL)
return (PyObject*)_PyUnicode_New(size);
if (size < 0) {
PyErr_BadInternalCall();
return NULL;
}
return PyUnicode_FromWideChar(u, size);
}
PyObject *
PyUnicode_FromWideChar(const wchar_t *u, Py_ssize_t size)
{
PyObject *unicode;
Py_UCS4 maxchar = 0;
Py_ssize_t num_surrogates;
if (u == NULL && size != 0) {
PyErr_BadInternalCall();
return NULL;
}
if (size == -1) {
size = wcslen(u);
}
/* If the Unicode data is known at construction time, we can apply
some optimizations which share commonly used objects. */
/* Optimization for empty strings */
if (size == 0)
_Py_RETURN_UNICODE_EMPTY();
/* Single character Unicode objects in the Latin-1 range are
shared when using this constructor */
if (size == 1 && (Py_UCS4)*u < 256)
return get_latin1_char((unsigned char)*u);
/* If not empty and not single character, copy the Unicode data
into the new object */
if (find_maxchar_surrogates(u, u + size,
&maxchar, &num_surrogates) == -1)
return NULL;
unicode = PyUnicode_New(size - num_surrogates, maxchar);
if (!unicode)
return NULL;
switch (PyUnicode_KIND(unicode)) {
case PyUnicode_1BYTE_KIND:
_PyUnicode_CONVERT_BYTES(Py_UNICODE, unsigned char,
u, u + size, PyUnicode_1BYTE_DATA(unicode));
break;
case PyUnicode_2BYTE_KIND:
#if Py_UNICODE_SIZE == 2
memcpy(PyUnicode_2BYTE_DATA(unicode), u, size * 2);
#else
_PyUnicode_CONVERT_BYTES(Py_UNICODE, Py_UCS2,
u, u + size, PyUnicode_2BYTE_DATA(unicode));
#endif
break;
case PyUnicode_4BYTE_KIND:
#if SIZEOF_WCHAR_T == 2
/* This is the only case which has to process surrogates, thus
a simple copy loop is not enough and we need a function. */
unicode_convert_wchar_to_ucs4(u, u + size, unicode);
#else
assert(num_surrogates == 0);
memcpy(PyUnicode_4BYTE_DATA(unicode), u, size * 4);
#endif
break;
default:
assert(0 && "Impossible state");
}
return unicode_result(unicode);
}
真正实现在PyUnicode_FromWideChar
函数内部。PyUnicode_FromWideChar
函数传入一个c中的字符串和字符串大小size,如果是在Latin-1
范围内的单字符字符串,通过get_latin1_char
函数直接返回,get_latin1_char
函数定义在Objects/unicodeobject.c
中:
/* Single character Unicode strings in the Latin-1 range are being
shared as well. */
static PyObject *unicode_latin1[256] = {NULL};
...
static PyObject*
get_latin1_char(unsigned char ch)
{
PyObject *unicode = unicode_latin1[ch];
if (!unicode) {
unicode = PyUnicode_New(1, ch);
if (!unicode)
return NULL;
PyUnicode_1BYTE_DATA(unicode)[0] = ch;
assert(_PyUnicode_CheckConsistency(unicode, 1));
unicode_latin1[ch] = unicode;
}
Py_INCREF(unicode);
return unicode;
}
get_latin1_char
实现了一个字符缓冲池,unicode_latin1
是一个PyObject
指针数组,缓存了单字符字符串对象,如果不在unicode_latin1
数组中,则通过PyUnicode_New
来创建。PyUnicode_FromWideChar
函数中也通过PyUnicode_New
来创建字符串对象。
PyObject *
PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
{
PyObject *obj;
PyCompactUnicodeObject *unicode;
void *data;
enum PyUnicode_Kind kind;
int is_sharing, is_ascii;
Py_ssize_t char_size;
Py_ssize_t struct_size;
/* Optimization for empty strings */
if (size == 0 && unicode_empty != NULL) {
Py_INCREF(unicode_empty);
return unicode_empty;
}
is_ascii = 0;
is_sharing = 0;
// [1]
struct_size = sizeof(PyCompactUnicodeObject);
if (maxchar < 128) {
kind = PyUnicode_1BYTE_KIND;
char_size = 1;
is_ascii = 1;
struct_size = sizeof(PyASCIIObject);
}
else if (maxchar < 256) {
kind = PyUnicode_1BYTE_KIND;
char_size = 1;
}
else if (maxchar < 65536) {
kind = PyUnicode_2BYTE_KIND;
char_size = 2;
if (sizeof(wchar_t) == 2)
is_sharing = 1;
}
else {
if (maxchar > MAX_UNICODE) {
PyErr_SetString(PyExc_SystemError,
"invalid maximum character passed to PyUnicode_New");
return NULL;
}
kind = PyUnicode_4BYTE_KIND;
char_size = 4;
if (sizeof(wchar_t) == 4)
is_sharing = 1;
}
/* Ensure we won't overflow the size. */
if (size < 0) {
PyErr_SetString(PyExc_SystemError,
"Negative size passed to PyUnicode_New");
return NULL;
}
if (size > ((PY_SSIZE_T_MAX - struct_size) / char_size - 1))
return PyErr_NoMemory();
/* Duplicated allocation code from _PyObject_New() instead of a call to
* PyObject_New() so we are able to allocate space for the object and
* it's data buffer.
*/
// [2]
obj = (PyObject *) PyObject_MALLOC(struct_size + (size + 1) * char_size);
if (obj == NULL)
return PyErr_NoMemory();
obj = PyObject_INIT(obj, &PyUnicode_Type);
if (obj == NULL)
return NULL;
unicode = (PyCompactUnicodeObject *)obj;
if (is_ascii)
data = ((PyASCIIObject*)obj) + 1;
else
data = unicode + 1;
// [3]
_PyUnicode_LENGTH(unicode) = size;
_PyUnicode_HASH(unicode) = -1;
_PyUnicode_STATE(unicode).interned = 0;
_PyUnicode_STATE(unicode).kind = kind;
_PyUnicode_STATE(unicode).compact = 1;
_PyUnicode_STATE(unicode).ready = 1;
_PyUnicode_STATE(unicode).ascii = is_ascii;
// [4]
if (is_ascii) {
((char*)data)[size] = 0;
_PyUnicode_WSTR(unicode) = NULL;
}
else if (kind == PyUnicode_1BYTE_KIND) {
((char*)data)[size] = 0;
_PyUnicode_WSTR(unicode) = NULL;
_PyUnicode_WSTR_LENGTH(unicode) = 0;
unicode->utf8 = NULL;
unicode->utf8_length = 0;
}
else {
unicode->utf8 = NULL;
unicode->utf8_length = 0;
if (kind == PyUnicode_2BYTE_KIND)
((Py_UCS2*)data)[size] = 0;
else /* kind == PyUnicode_4BYTE_KIND */
((Py_UCS4*)data)[size] = 0;
if (is_sharing) {
_PyUnicode_WSTR_LENGTH(unicode) = size;
_PyUnicode_WSTR(unicode) = (wchar_t *)data;
}
else {
_PyUnicode_WSTR_LENGTH(unicode) = 0;
_PyUnicode_WSTR(unicode) = NULL;
}
}
#ifdef Py_DEBUG
unicode_fill_invalid((PyObject*)unicode, 0);
#endif
assert(_PyUnicode_CheckConsistency((PyObject*)unicode, 0));
return obj;
}
PyUnicode_New
函数通过传入对象的大小size
和maxchar
参数来决定[1]
返回的是PyASCIIObject
,PyCompactUnicodeObject
还是PyUnicodeObject
结构体,[2]
并申请内存块,[3]
初始化结构体的成员,另外[4]
也在初始化了空数据。
这一系列操作之后就会填充数据到字符串对象中,原字符串会使用不同的编码方案来编码并存储到结构体的不同位置。编码方案取决于字符串对象中的state.kind
标志位,也就是原字符串中的最大字符maxchar
。
存储位置通过PyUnicode_1BYTE_DATA
,PyUnicode_2BYTE_DATA
和PyUnicode_4BYTE_DATA
三个宏来定位的,定义在Include/unicodeobject.h
中:
/* Return pointers to the canonical representation cast to unsigned char,
Py_UCS2, or Py_UCS4 for direct character access.
No checks are performed, use PyUnicode_KIND() before to ensure
these will work correctly. */
#define PyUnicode_1BYTE_DATA(op) ((Py_UCS1*)PyUnicode_DATA(op))
#define PyUnicode_2BYTE_DATA(op) ((Py_UCS2*)PyUnicode_DATA(op))
#define PyUnicode_4BYTE_DATA(op) ((Py_UCS4*)PyUnicode_DATA(op))
...
/* Return a void pointer to the raw unicode buffer. */
#define _PyUnicode_COMPACT_DATA(op) \
(PyUnicode_IS_ASCII(op) ? \
((void*)((PyASCIIObject*)(op) + 1)) : \
((void*)((PyCompactUnicodeObject*)(op) + 1)))
#define _PyUnicode_NONCOMPACT_DATA(op) \
(assert(((PyUnicodeObject*)(op))->data.any), \
((((PyUnicodeObject *)(op))->data.any)))
#define PyUnicode_DATA(op) \
(assert(PyUnicode_Check(op)), \
PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) : \
_PyUnicode_NONCOMPACT_DATA(op))
0x02 字符串对象的操作
字符串对象的类型PyUnicode_Type
,定义在Objects/unicodeobject.c
中:
PyTypeObject PyUnicode_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"str", /* tp_name */
sizeof(PyUnicodeObject), /* tp_size */
...
unicode_repr, /* tp_repr */
&unicode_as_number, /* tp_as_number */
&unicode_as_sequence, /* tp_as_sequence */
&unicode_as_mapping, /* tp_as_mapping */
(hashfunc) unicode_hash, /* tp_hash*/
0, /* tp_call*/
...
};
从字符串类型对象PyUnicode_Type
可以看得出来,字符串对象支持数值操作(tp_as_number
),序列操作(tp_as_sequence
),映射操作(tp_as_mapping
)。我们简单的看下序列操作:
static PySequenceMethods unicode_as_sequence = {
(lenfunc) unicode_length, /* sq_length */
PyUnicode_Concat, /* sq_concat */
(ssizeargfunc) unicode_repeat, /* sq_repeat */
(ssizeargfunc) unicode_getitem, /* sq_item */
0, /* sq_slice */
0, /* sq_ass_item */
0, /* sq_ass_slice */
PyUnicode_Contains, /* sq_contains */
};
字符串连接
Python中的字符串"+"
操作对应的就是PyUnicode_Concat
函数:
/* Concat to string or Unicode object giving a new Unicode object. */
PyObject *
PyUnicode_Concat(PyObject *left, PyObject *right)
{
PyObject *result;
Py_UCS4 maxchar, maxchar2;
Py_ssize_t left_len, right_len, new_len;
if (ensure_unicode(left) < 0)
return NULL;
if (!PyUnicode_Check(right)) {
PyErr_Format(PyExc_TypeError,
"can only concatenate str (not \"%.200s\") to str",
right->ob_type->tp_name);
return NULL;
}
if (PyUnicode_READY(right) < 0)
return NULL;
/* Shortcuts */
if (left == unicode_empty)
return PyUnicode_FromObject(right);
if (right == unicode_empty)
return PyUnicode_FromObject(left);
left_len = PyUnicode_GET_LENGTH(left);
right_len = PyUnicode_GET_LENGTH(right);
if (left_len > PY_SSIZE_T_MAX - right_len) {
PyErr_SetString(PyExc_OverflowError,
"strings are too large to concat");
return NULL;
}
// [1]
new_len = left_len + right_len;
maxchar = PyUnicode_MAX_CHAR_VALUE(left);
maxchar2 = PyUnicode_MAX_CHAR_VALUE(right);
maxchar = Py_MAX(maxchar, maxchar2);
/* Concat the two Unicode strings */
// [2]
result = PyUnicode_New(new_len, maxchar);
if (result == NULL)
return NULL;
// [3]
_PyUnicode_FastCopyCharacters(result, 0, left, 0, left_len);
_PyUnicode_FastCopyCharacters(result, left_len, right, 0, right_len);
assert(_PyUnicode_CheckConsistency(result, 1));
return result;
}
因为字符串对象是不可变对象,进行连接操作是会创建一个新的字符串对象。所以两字符串对象连接[1]
会先计算连接后字符串的长度,[2]
通过PyUnicode_New
来申请内存空间,最后[3]
复制两字符串的内存空间数据。
这种连接操作作用在N个字符串对象上就显得非常低效率,连接N个字符串对象就需要进行N-1
次的内存申请和(N-1)*2
次的内存搬运工作,另外还有隐藏的垃圾回收操作。更好的做法是通过join
操作来连接。
字符串hash
在前面提过PyUnicodeObject
结构体中的hash
成员会缓存了字符串对象的hash
值,字符串对象的hash操作在unicode_hash
函数中实现:
/* Believe it or not, this produces the same value for ASCII strings
as bytes_hash(). */
static Py_hash_t
unicode_hash(PyObject *self)
{
Py_ssize_t len;
Py_uhash_t x; /* Unsigned for defined overflow behavior. */
#ifdef Py_DEBUG
assert(_Py_HashSecret_Initialized);
#endif
if (_PyUnicode_HASH(self) != -1)
return _PyUnicode_HASH(self);
if (PyUnicode_READY(self) == -1)
return -1;
len = PyUnicode_GET_LENGTH(self);
/*
We make the hash of the empty string be 0, rather than using
(prefix ^ suffix), since this slightly obfuscates the hash secret
*/
if (len == 0) {
_PyUnicode_HASH(self) = 0;
return 0;
}
x = _Py_HashBytes(PyUnicode_DATA(self),
PyUnicode_GET_LENGTH(self) * PyUnicode_KIND(self));
_PyUnicode_HASH(self) = x;
return x;
}
不管你信不信,
unicode_hash
函数计算出和bytes_hash
同样的值
_Py_HashBytes
函数在Python/pyhash.c
中定义着:
Py_hash_t
_Py_HashBytes(const void *src, Py_ssize_t len)
{
Py_hash_t x;
/*
We make the hash of the empty string be 0, rather than using
(prefix ^ suffix), since this slightly obfuscates the hash secret
*/
if (len == 0) {
return 0;
}
#ifdef Py_HASH_STATS
hashstats[(len <= Py_HASH_STATS_MAX) ? len : 0]++;
#endif
#if Py_HASH_CUTOFF > 0
if (len < Py_HASH_CUTOFF) {
// [1]
/* Optimize hashing of very small strings with inline DJBX33A. */
Py_uhash_t hash;
const unsigned char *p = src;
hash = 5381; /* DJBX33A starts with 5381 */
switch(len) {
/* ((hash << 5) + hash) + *p == hash * 33 + *p */
case 7: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 6: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 5: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 4: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 3: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 2: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
case 1: hash = ((hash << 5) + hash) + *p++; break;
default:
assert(0);
}
hash ^= len;
hash ^= (Py_uhash_t) _Py_HashSecret.djbx33a.suffix;
x = (Py_hash_t)hash;
}
else
#endif /* Py_HASH_CUTOFF */
// [2]
x = PyHash_Func.hash(src, len);
if (x == -1)
return -2;
return x;
}
Python 3.4开始引入SipHash
算法[1]
作为默认的hash算法,并保留原来的FNV
算法[2]
作为没有64位数据类型的平台的hash算法。PEP 456 -- Secure and interchangeable hash algorithm描述了这个改变。
延伸阅读
intern机制的字符串有什么限制呢?,应该是在PyDict_SetDefault()这个函数中有限制,比如带有空格的字符串就不行,望解答,谢谢大佬!
In [50]: a = 'python'
In [51]: b = 'python'
In [52]: id(a)
Out[52]: 442782398256
In [53]: id(b)
Out[53]: 442782398256
In [54]: b = 'hello python'
In [55]: a = 'hello python'
In [56]: id(a)
Out[56]: 442808585520
In [57]: id(b)
Out[57]: 442726541488
@Panlq intern 机制和 PyDict_SetDefault 没有关系的,PyDict_SetDefault 只是 dict 的一个操作而已。
intern 的机制其实和编译时期有关,具体的代码在 Objects/codeobject.c 里面
// Objects/codeobject.c
/* Intern selected string constants */
static int
intern_string_constants(PyObject *tuple)
{
int modified = 0;
Py_ssize_t i;
for (i = PyTuple_GET_SIZE(tuple); --i >= 0; ) {
PyObject *v = PyTuple_GET_ITEM(tuple, i);
if (PyUnicode_CheckExact(v)) {
if (PyUnicode_READY(v) == -1) {
PyErr_Clear();
continue;
}
if (all_name_chars(v)) {
PyObject *w = v;
PyUnicode_InternInPlace(&v);
if (w != v) {
PyTuple_SET_ITEM(tuple, i, v);
modified = 1;
}
}
}
可见 all_name_chars 决定了是否会 intern,简单来说就是 ascii 字母,数字和下划线组成的字符串会被缓存。
再则,你会发现:
In [1]: a = ' '
In [2]: b = ' '
In [3]: id(a), id(b)
Out[3]: (4417981944, 4417981944)
好像空格也被缓存了,和上面的规则冲突了。其实不尽然,因为 python3 中通过 unicode_latin1[256] 将长度为 1 的 ascii 的字符也缓存了。
// Objects/unicodeobject.c
/* Single character Unicode strings in the Latin-1 range are being
shared as well. */
static PyObject *unicode_latin1[256] = {NULL};
同理,空字符串也会被缓存
// Objects/unicodeobject.c
/* The empty Unicode object is shared to improve performance. */
static PyObject *unicode_empty = NULL;
另外,Python在编译器会优化字节码,像
In [1]: a = 'hello' + 'python'
In [2]: b = 'hellopython'
In [3]: id(a), id(b)
Out[3]: (4440501936, 4440501936)
In [4]: 'hello ' + 'python' is 'hello python'
Out[9]: True
可以看到就算有空格的字符串他们也是同一个对象。
In [1]: import dis
In [2]: dis.dis("'hello ' + 'python' is 'hello python'")
1 0 LOAD_CONST 0 ('hello python')
2 LOAD_CONST 0 ('hello python')
dis 之后会发现 load_const 指向的位置都是 0
后面这部分好像有问题吧,请问您用的是ipython3吗,我执行的效果跟您的不一样
In [8]: a = 'hello' + 'python'
In [9]: b = 'hellopython'
In [10]: a is b
Out[10]: True
In [11]: a = 'hello ' + 'python'
In [12]: b = 'hello python'
In [13]: id(a)
Out[13]: 118388503536
In [14]: id(b)
Out[14]: 118387544240
In [15]: 'hello ' + 'python' is 'hello python'
Out[15]: False
根据您的提示我去看了对应的位置代码
/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
static int
all_name_chars(PyObject *o)
{
const unsigned char *s, *e;
if (!PyUnicode_IS_ASCII(o))
return 0;
s = PyUnicode_1BYTE_DATA(o);
e = s + PyUnicode_GET_LENGTH(o);
for (; s != e; s++) {
if (!Py_ISALNUM(*s) && *s != '_')
return 0;
}
return 1;
}
正如您第一次回复所说,字符串形式的时候,只有字母数字下划线组成的字符串才会被interned。 总结来说:
- 单字符,将长度为 1 的 ascii 的字符latin1会被interned, 包括空字符,常驻内存。
- 由字母数字下划线组成的字符串会被interned,正常来说引用计数为0时会被销毁。
@Panlq 嗯嗯,我用的是python3,2和3的一些实现不太一样
有一个疑问请教一下哈:
PyObject *
PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
{
if (u == NULL)
return (PyObject*)_PyUnicode_New(size);
if (size < 0) {
PyErr_BadInternalCall();
return NULL;
}
return PyUnicode_FromWideChar(u, size);
}
我看PyUnicode_FromUnicode方法中,_PyUnicode_New才是返回PyUnicodeObject对象的,PyUnicode_FromWideChar中调用的PyUnicode_New只针对PyAsciiObject和PyCompactUnicodeObject做了处理,看判断const Py_UNICODE *u==NULL的时候才会返回PyUnicodeObject对象,疑问就是什么场景下u==NULL呢?