blog
blog copied to clipboard
Python的垃圾回收机制
0x00 引用计数
在Python中,大多数对象的生命周期都是通过对象的引用计数来管理的。
增引用操作
#define Py_INCREF(op) ( \
_Py_INC_REFTOTAL _Py_REF_DEBUG_COMMA \
((PyObject *)(op))->ob_refcnt++)
减引用操作
#define Py_DECREF(op) \
do { \
PyObject *_py_decref_tmp = (PyObject *)(op); \
if (_Py_DEC_REFTOTAL _Py_REF_DEBUG_COMMA \
--(_py_decref_tmp)->ob_refcnt != 0) \
_Py_CHECK_REFCNT(_py_decref_tmp) \
else \
_Py_Dealloc(_py_decref_tmp); \
} while (0)
如果减引用之后计数器不为0就会通过_Py_CHECK_REFCNT检查引用计数器是否为负。
#define _Py_CHECK_REFCNT(OP) \
{ if (((PyObject*)OP)->ob_refcnt < 0) \
_Py_NegativeRefcount(__FILE__, __LINE__, \
(PyObject *)(OP)); \
}
_Py_NegativeRefcount函数会把计数器为负的对象信息作为错误信息输出,函数定义在Objects/object.c中:
#ifdef Py_REF_DEBUG
/* Log a fatal error; doesn't return. */
void
_Py_NegativeRefcount(const char *fname, int lineno, PyObject *op)
{
char buf[300];
PyOS_snprintf(buf, sizeof(buf),
"%s:%i object at %p has negative ref count "
"%" PY_FORMAT_SIZE_T "d",
fname, lineno, op, op->ob_refcnt);
Py_FatalError(buf);
}
#endif /* Py_REF_DEBUG */
如果计数器为0就通过调用_Py_Dealloc来释放对象。
#define _Py_Dealloc(op) ( \
_Py_INC_TPFREES(op) _Py_COUNT_ALLOCS_COMMA \
(*Py_TYPE(op)->tp_dealloc)((PyObject *)(op)))
tp_dealloc指向了对象的析构函数。
所以引用计数的减量过程是这样的一个调用链:
Py_DECREF -> _Py_Dealloc -> tp_dealloc -> tp_free -> _PyObject_Free
此外,Python也提供了检查NULL的宏:
/* Macros to use in case the object pointer may be NULL: */
#define Py_XINCREF(op) \
do { \
PyObject *_py_xincref_tmp = (PyObject *)(op); \
if (_py_xincref_tmp != NULL) \
Py_INCREF(_py_xincref_tmp); \
} while (0)
#define Py_XDECREF(op) \
do { \
PyObject *_py_xdecref_tmp = (PyObject *)(op); \
if (_py_xdecref_tmp != NULL) \
Py_DECREF(_py_xdecref_tmp); \
} while (0)
0x01 标记清除「Mark-Sweep」& 分代
在Python中,主要的内存管理手段是引用计数机制,而标记-清除和分代收集只是为了打破循环引用而引入的补充技术。这一事实意味着Python中的垃圾收集只关注可能会产生循环引用的对象。很显然,像PyIntObject、PyStringObject这些对象是绝不可能产生循环引用的,因为它们内部不可能持有对其他对象的引用。Python中的循环引用总是发生在container对象之间,所谓container对象即是内部可持有对其他对象的引用的对象,比如list、dict、class、instance等等。
Python通过PyGC_Head来跟踪container对象,PyGC_Head信息位于PyObject_HEAD之前,定义在Include/objimpl.h中:
typedef union _gc_head {
struct {
union _gc_head *gc_next;
union _gc_head *gc_prev;
Py_ssize_t gc_refs;
} gc;
double dummy; /* force worst-case alignment */
} PyGC_Head;
Python中用于分代垃圾回收有三个“代”来表示,由_gc_runtime_state.generations数组所表示着:
// Include/internal/mem.h
/* If we change this, we need to change the default value in the
signature of gc.collect. */
#define NUM_GENERATIONS 3
...
struct _gc_runtime_state {
...
int enabled;
int debug;
/* linked lists of container objects */
struct gc_generation generations[NUM_GENERATIONS];
PyGC_Head *generation0;
...
/* This is the number of objects that survived the last full
collection. It approximates the number of long lived objects
tracked by the GC.
(by "full collection", we mean a collection of the oldest
generation). */
Py_ssize_t long_lived_total;
/* This is the number of objects that survived all "non-full"
collections, and are awaiting to undergo a full collection for
the first time. */
Py_ssize_t long_lived_pending;
};
_gc_runtime_state是gc运行状态对象,enabled表示是否启用gc。
Python中不是单独回收其中一代的垃圾对象,而是将需要回收的代以及年轻代合并一起回收,
- 完全回收(full collection):回收全部代的垃圾对象;
- 不完全回收(non-full collection):回收除最老代以外的其他代,只回收第0代或者第0代和第1代一起回收。
long_lived_total表示上一次完全回收中幸存的对象,long_lived_pending表示非完全回收幸存的对象,等待一次完全回收。下面会看到Python中利用这两个指标来控制完全回收频率。
_gc_runtime_state在Modules/gcmodule.c被_PyGC_Initialize函数初始化。
void
_PyGC_Initialize(struct _gc_runtime_state *state)
{
state->enabled = 1; /* automatic collection enabled? */
#define _GEN_HEAD(n) (&state->generations[n].head)
struct gc_generation generations[NUM_GENERATIONS] = {
/* PyGC_Head, threshold, count */
{{{_GEN_HEAD(0), _GEN_HEAD(0), 0}}, 700, 0},
{{{_GEN_HEAD(1), _GEN_HEAD(1), 0}}, 10, 0},
{{{_GEN_HEAD(2), _GEN_HEAD(2), 0}}, 10, 0},
};
for (int i = 0; i < NUM_GENERATIONS; i++) {
state->generations[i] = generations[i];
};
state->generation0 = GEN_HEAD(0);
...
}
gc_generation表示分代回收中的一个“代”,定义在Include/internal/mem.h:
struct gc_generation {
PyGC_Head head;
int threshold; /* collection threshold */
int count; /* count of allocations or collections of younger
generations */
};
head指向PyGC_Head连接的双向container对象链表,可见三个“代”其实就是三条链表来实现的;
threshold表示每一代垃圾回收的阈值,count则有两种表示:
- 在新生代中表示分配的container对象数量;
- 中年代和老年代中表示前一代垃圾回收的次数。
每一代count超过threshold会触发垃圾回收,可以通过 gc.get_threshold() 来获取该值。
初始化的三个“代”,generation0指向第0“代”:

我们回过头去看看Python内置对象的创建过程就会发现,container对象是通过PyObject_GC_New函数来创建的,而非container对象是通过PyObject_Malloc函数来创建的。
PyObject_GC_New函数定义在Include/objimpl.h中:
#define PyObject_GC_New(type, typeobj) \
( (type *) _PyObject_GC_New(typeobj) )
调用了Modules/gcmodule.c中的_PyObject_GC_New函数:
PyObject *
_PyObject_GC_New(PyTypeObject *tp)
{
PyObject *op = _PyObject_GC_Malloc(_PyObject_SIZE(tp));
if (op != NULL)
op = PyObject_INIT(op, tp);
return op;
}
PyObject *
_PyObject_GC_Malloc(size_t basicsize)
{
return _PyObject_GC_Alloc(0, basicsize);
}
static PyObject *
_PyObject_GC_Alloc(int use_calloc, size_t basicsize)
{
PyObject *op;
PyGC_Head *g;
size_t size;
if (basicsize > PY_SSIZE_T_MAX - sizeof(PyGC_Head))
return PyErr_NoMemory();
size = sizeof(PyGC_Head) + basicsize;
// [1]
if (use_calloc)
g = (PyGC_Head *)PyObject_Calloc(1, size);
else
g = (PyGC_Head *)PyObject_Malloc(size);
if (g == NULL)
return PyErr_NoMemory();
// [2]
g->gc.gc_refs = 0;
_PyGCHead_SET_REFS(g, GC_UNTRACKED);
// [3]
generations[0].count++; /* number of allocated GC objects */
if (generations[0].count > generations[0].threshold &&
enabled &&
generations[0].threshold &&
!collecting &&
!PyErr_Occurred()) {
collecting = 1;
collect_generations();
collecting = 0;
}
// [4]
op = FROM_GC(g);
return op;
}
[1] 申请PyGC_Head和对象本身的内存
[2] 设置gc_refs的值,并标记为GC_UNTRACKED表示对象还未被追踪。
[3] 对第0“代”的count增量操作,并检查是否进行循环引用垃圾回收。
[4] FROM_GC宏定义可以通过PyGC_Head地址转换PyObject_HEAD地址,逆运算是AS_GC宏定义。
我们会发现_PyObject_GC_Alloc只是创建container对象,那container对象是什么时候加入第0“代”的container对象链表呢?
看下list对象的创建过程:
// Objects/listobject.c
PyObject *
PyList_New(Py_ssize_t size)
{
...
op = PyObject_GC_New(PyListObject, &PyList_Type);
if (op == NULL)
return NULL;
}
...
Py_SIZE(op) = size;
op->allocated = size;
_PyObject_GC_TRACK(op);
return (PyObject *) op;
}
答案是通过_PyObject_GC_TRACK宏来操作。
/* Tell the GC to track this object. NB: While the object is tracked the
* collector it must be safe to call the ob_traverse method. */
#define _PyObject_GC_TRACK(o) do { \
PyGC_Head *g = _Py_AS_GC(o); \
if (_PyGCHead_REFS(g) != _PyGC_REFS_UNTRACKED) \
Py_FatalError("GC object already tracked"); \
_PyGCHead_SET_REFS(g, _PyGC_REFS_REACHABLE); \
g->gc.gc_next = _PyGC_generation0; \
g->gc.gc_prev = _PyGC_generation0->gc.gc_prev; \
g->gc.gc_prev->gc.gc_next = g; \
_PyGC_generation0->gc.gc_prev = g; \
} while (0);
将container对象加入generation0链表中,并标记为_PyGC_REFS_REACHABLE表示已被追踪且可达。
每次创建container对象的时候会检查是否达到垃圾回收的触发条件[3],然后将_PyRuntime.gc.collection设为1,表示正在进行GC,collect_generations进入循环引用垃圾回收阶段。
static Py_ssize_t
collect_generations(void)
{
int i;
Py_ssize_t n = 0;
/* Find the oldest generation (highest numbered) where the count
* exceeds the threshold. Objects in the that generation and
* generations younger than it will be collected. */
for (i = NUM_GENERATIONS-1; i >= 0; i--) {
if (_PyRuntime.gc.generations[i].count > _PyRuntime.gc.generations[i].threshold) {
/* Avoid quadratic performance degradation in number
of tracked objects. See comments at the beginning
of this file, and issue #4074.
*/
if (i == NUM_GENERATIONS - 1
&& _PyRuntime.gc.long_lived_pending < _PyRuntime.gc.long_lived_total / 4)
continue;
n = collect_with_callback(i);
break;
}
}
return n;
}
/* Perform garbage collection of a generation and invoke
* progress callbacks.
*/
static Py_ssize_t
collect_with_callback(int generation)
{
Py_ssize_t result, collected, uncollectable;
invoke_gc_callback("start", generation, 0, 0);
result = collect(generation, &collected, &uncollectable, 0);
invoke_gc_callback("stop", generation, collected, uncollectable);
return result;
}
从注释中可以看出来,collect_generations函数寻找满足超过阈值的最老“代”(也就是generations数组中序号最高的那一“代”),并回收最老的这一“代”以及比这一“代”年轻的“代”的对象(通常情况下,第0代对象都是作为被回收对象,因为满足第0代触发条件)。
对于最老一代,必须满足long_lived_pending / long_lived_total超过25%的比例才进行完全垃圾回收。
collect_with_callback函数中对满足条件的“代”进行垃圾回收并回调通知垃圾回收开始和结束。
collect是垃圾回收的主入口函数。
/* This is the main function. Read this to understand how the
* collection process works. */
static Py_ssize_t
collect(int generation, Py_ssize_t *n_collected, Py_ssize_t *n_uncollectable,
int nofail)
{
int i;
Py_ssize_t m = 0; /* # objects collected */
Py_ssize_t n = 0; /* # unreachable objects that couldn't be collected */
PyGC_Head *young; /* the generation we are examining */
PyGC_Head *old; /* next older generation */
PyGC_Head unreachable; /* non-problematic unreachable trash */
PyGC_Head finalizers; /* objects with, & reachable from, __del__ */
PyGC_Head *gc;
_PyTime_t t1 = 0; /* initialize to prevent a compiler warning */
struct gc_generation_stats *stats = &_PyRuntime.gc.generation_stats[generation];
...
// “标记-清除”前的准备
// 垃圾标记
// 垃圾清除
...
/* Update stats */
if (n_collected)
*n_collected = m;
if (n_uncollectable)
*n_uncollectable = n;
stats->collections++;
stats->collected += m;
stats->uncollectable += n;
if (PyDTrace_GC_DONE_ENABLED())
PyDTrace_GC_DONE(n+m);
return n+m;
}
“标记-清除”前的准备
// [1]
/* update collection and allocation counters */
if (generation+1 < NUM_GENERATIONS)
_PyRuntime.gc.generations[generation+1].count += 1;
for (i = 0; i <= generation; i++)
_PyRuntime.gc.generations[i].count = 0;
// [2]
/* merge younger generations with one we are currently collecting */
for (i = 0; i < generation; i++) {
gc_list_merge(GEN_HEAD(i), GEN_HEAD(generation));
}
// [3]
/* handy references */
young = GEN_HEAD(generation);
if (generation < NUM_GENERATIONS-1)
old = GEN_HEAD(generation+1);
else
old = young;
// [4]
/* Using ob_refcnt and gc_refs, calculate which objects in the
* container set are reachable from outside the set (i.e., have a
* refcount greater than 0 when all the references within the
* set are taken into account).
*/
update_refs(young);
subtract_refs(young);
[1] 先更新了将被回收的“代”以及老一“代”的count计数器。
这边对老一“代”的count计数器增量1就可以看出来在第1“代”和第2“代”的count值其实表示的是该代垃圾回收的次数。
[2] 通过gc_list_merge函数将这些“代”合并成一个链表。gc_list_merge函数将前一个链表链接到后一个链表末尾并把前一个链表置为空链表。
[3] 经过合并操作之后,所有需要被进行垃圾回收的对象都链接到young“代”(满足超过阈值的最老“代”),并记录old“代”,后面需要将不可回收的对象移到old“代”。
链表的合并操作:

[4] 垃圾标记
标记-清除是一种基于追踪(Tracing)回收技术实现的垃圾回收算法,对象之间通过引用(指针)连在一起,构成一个有向图,对象构成这个有向图的节点,而引用关系构成这个有向图的边。从根对象(root object)出发,沿着有向边遍历对象:标记阶段,遍历到的对象,标记该对象为可达的(reachable),也就是还有对象引用它;清除阶段,遍历所有对象,如果发现某个对象没有标记为可达,则就将其回收。
要对合并的链表进行垃圾标记,首先需要寻找root object集合。 所谓的root object即是一些全局引用和函数栈中的引用。这些引用所用的对象是不可被删除的。
list1 = []
list2 = []
list1.append(list2)
list2.append(list1)
a = list1
del list1
del list2
思考上面的Python中循环引用的代码,变量a所指向的对象就是root object。
我们注意到这样一个事实,如果两个对象的引用计数都为1,但是仅仅存在它们之间的循环引用,那么这两个对象都是需要被回收的,也就是说,虽然它们的引用计数虽然表现为非0,但实际上有效的引用计数为0。这里,我们提出了有效引用计数的概念,为了从引用计数获得有效引用计数,必须将循环引用的影响去除,也就是说,将环从引用中摘除,具体的实现就是两个对象各自的引用计数值都减去1。这样一来,两个对象的引用计数都成为了0,我们挥去了循环引用的迷雾,使有效引用计数现出了真身。
/* Set all gc_refs = ob_refcnt. After this, gc_refs is > 0 for all objects
* in containers, and is GC_REACHABLE for all tracked gc objects not in
* containers.
*/
static void
update_refs(PyGC_Head *containers)
{
PyGC_Head *gc = containers->gc.gc_next;
for (; gc != containers; gc = gc->gc.gc_next) {
assert(_PyGCHead_REFS(gc) == GC_REACHABLE);
_PyGCHead_SET_REFS(gc, Py_REFCNT(FROM_GC(gc)));
/* Python's cyclic gc should never see an incoming refcount
* of 0: if something decref'ed to 0, it should have been
* deallocated immediately at that time.
* Possible cause (if the assert triggers): a tp_dealloc
* routine left a gc-aware object tracked during its teardown
* phase, and did something-- or allowed something to happen --
* that called back into Python. gc can trigger then, and may
* see the still-tracked dying object. Before this assert
* was added, such mistakes went on to allow gc to try to
* delete the object again. In a debug build, that caused
* a mysterious segfault, when _Py_ForgetReference tried
* to remove the object from the doubly-linked list of all
* objects a second time. In a release build, an actual
* double deallocation occurred, which leads to corruption
* of the allocator's internal bookkeeping pointers. That's
* so serious that maybe this should be a release-build
* check instead of an assert?
*/
assert(_PyGCHead_REFS(gc) != 0);
}
}
为了避免直接修改的对象的真实引用计数,update_refs函数将对象的引用计数复制一份副本到PyGC_Head中的gc.gc_ref中。
subtract_refs负责清除链表对象之间的循环引用。
/* Subtract internal references from gc_refs. After this, gc_refs is >= 0
* for all objects in containers, and is GC_REACHABLE for all tracked gc
* objects not in containers. The ones with gc_refs > 0 are directly
* reachable from outside containers, and so can't be collected.
*/
static void
subtract_refs(PyGC_Head *containers)
{
traverseproc traverse;
PyGC_Head *gc = containers->gc.gc_next;
for (; gc != containers; gc=gc->gc.gc_next) {
traverse = Py_TYPE(FROM_GC(gc))->tp_traverse;
(void) traverse(FROM_GC(gc),
(visitproc)visit_decref,
NULL);
}
}
subtract_refs函数遍历所有container对象,调用了container对象类型的tp_traverse指针指向的遍历函数,这里visit_decref函数作为访问者模式的访问者传入了。
我们可以看下list对象的tp_traverse指针指向的遍历函数,定义在Object/listobject.c中:
// Object/listobject.c
static int
list_traverse(PyListObject *o, visitproc visit, void *arg)
{
Py_ssize_t i;
for (i = Py_SIZE(o); --i >= 0; )
Py_VISIT(o->ob_item[i]);
return 0;
}
// Include/objimpl.h
/* Utility macro to help write tp_traverse functions.
* To use this macro, the tp_traverse function must name its arguments
* "visit" and "arg". This is intended to keep tp_traverse functions
* looking as much alike as possible.
*/
#define Py_VISIT(op) \
do { \
if (op) { \
int vret = visit((PyObject *)(op), arg); \
if (vret) \
return vret; \
} \
} while (0)
遍历列表里面的所有元素并调用Py_VISIT宏,对所有元素调用了visit函数(访问者),这里是visit_decref函数。
/* A traversal callback for subtract_refs. */
static int
visit_decref(PyObject *op, void *data)
{
assert(op != NULL);
if (PyObject_IS_GC(op)) {
PyGC_Head *gc = AS_GC(op);
/* We're only interested in gc_refs for objects in the
* generation being collected, which can be recognized
* because only they have positive gc_refs.
*/
assert(_PyGCHead_REFS(gc) != 0); /* else refcount was too small */
if (_PyGCHead_REFS(gc) > 0)
_PyGCHead_DECREF(gc);
}
return 0;
}
visit_decref函数对op对象的成员gc_refs进行减量操作。
执行完subtract_refs之后,可收集对象链表中所有contianer对象之间的环引用都被摘除了。
垃圾标记
// [1]
/* Leave everything reachable from outside young in young, and move
* everything else (in young) to unreachable.
* NOTE: This used to move the reachable objects into a reachable
* set instead. But most things usually turn out to be reachable,
* so it's more efficient to move the unreachable things.
*/
gc_list_init(&unreachable);
move_unreachable(young, &unreachable);
// [2]
/* Move reachable objects to next generation. */
if (young != old) {
if (generation == NUM_GENERATIONS - 2) {
_PyRuntime.gc.long_lived_pending += gc_list_size(young);
}
gc_list_merge(young, old);
}
else {
/* We only untrack dicts in full collections, to avoid quadratic
dict build-up. See issue #14775. */
untrack_dicts(young);
_PyRuntime.gc.long_lived_pending = 0;
_PyRuntime.gc.long_lived_total = gc_list_size(young);
}
[1] 初始化不可达链表,调用move_unreachable函数将循环引用的对象移动到不可达链表中:
/* Move the unreachable objects from young to unreachable. After this,
* all objects in young have gc_refs = GC_REACHABLE, and all objects in
* unreachable have gc_refs = GC_TENTATIVELY_UNREACHABLE. All tracked
* gc objects not in young or unreachable still have gc_refs = GC_REACHABLE.
* All objects in young after this are directly or indirectly reachable
* from outside the original young; and all objects in unreachable are
* not.
*/
static void
move_unreachable(PyGC_Head *young, PyGC_Head *unreachable)
{
PyGC_Head *gc = young->gc.gc_next;
/* Invariants: all objects "to the left" of us in young have gc_refs
* = GC_REACHABLE, and are indeed reachable (directly or indirectly)
* from outside the young list as it was at entry. All other objects
* from the original young "to the left" of us are in unreachable now,
* and have gc_refs = GC_TENTATIVELY_UNREACHABLE. All objects to the
* left of us in 'young' now have been scanned, and no objects here
* or to the right have been scanned yet.
*/
while (gc != young) {
PyGC_Head *next;
if (_PyGCHead_REFS(gc)) {
/* gc is definitely reachable from outside the
* original 'young'. Mark it as such, and traverse
* its pointers to find any other objects that may
* be directly reachable from it. Note that the
* call to tp_traverse may append objects to young,
* so we have to wait until it returns to determine
* the next object to visit.
*/
PyObject *op = FROM_GC(gc);
traverseproc traverse = Py_TYPE(op)->tp_traverse;
assert(_PyGCHead_REFS(gc) > 0);
_PyGCHead_SET_REFS(gc, GC_REACHABLE);
(void) traverse(op,
(visitproc)visit_reachable,
(void *)young);
next = gc->gc.gc_next;
if (PyTuple_CheckExact(op)) {
_PyTuple_MaybeUntrack(op);
}
}
else {
/* This *may* be unreachable. To make progress,
* assume it is. gc isn't directly reachable from
* any object we've already traversed, but may be
* reachable from an object we haven't gotten to yet.
* visit_reachable will eventually move gc back into
* young if that's so, and we'll see it again.
*/
next = gc->gc.gc_next;
gc_list_move(gc, unreachable);
_PyGCHead_SET_REFS(gc, GC_TENTATIVELY_UNREACHABLE);
}
gc = next;
}
}
这边遍历young“代”的container对象链表,_PyGCHead_REFS(gc)判断是不是root object或从某个root object能直接/间接引用的对象,由于root object集合中的对象是不能回收的,因此,被这些对象直接或间接引用的对象也是不能回收的。
_PyGCHead_REFS(gc)为0并不能断定这个对象是可回收的,但是还是先移动到unreachable链表中,设置了GC_TENTATIVELY_UNREACHABLE标志表示暂且认为是不可达的,如果是存在被root object直接或间接引用的对象,这样的对象还会被移出unreachable链表中。
遍历函数传入访问者visit_reachable函数:
/* A traversal callback for move_unreachable. */
static int
visit_reachable(PyObject *op, PyGC_Head *reachable)
{
if (PyObject_IS_GC(op)) {
PyGC_Head *gc = AS_GC(op);
const Py_ssize_t gc_refs = _PyGCHead_REFS(gc);
if (gc_refs == 0) {
/* This is in move_unreachable's 'young' list, but
* the traversal hasn't yet gotten to it. All
* we need to do is tell move_unreachable that it's
* reachable.
*/
_PyGCHead_SET_REFS(gc, 1);
}
else if (gc_refs == GC_TENTATIVELY_UNREACHABLE) {
/* This had gc_refs = 0 when move_unreachable got
* to it, but turns out it's reachable after all.
* Move it back to move_unreachable's 'young' list,
* and move_unreachable will eventually get to it
* again.
*/
gc_list_move(gc, reachable);
_PyGCHead_SET_REFS(gc, 1);
}
/* Else there's nothing to do.
* If gc_refs > 0, it must be in move_unreachable's 'young'
* list, and move_unreachable will eventually get to it.
* If gc_refs == GC_REACHABLE, it's either in some other
* generation so we don't care about it, or move_unreachable
* already dealt with it.
* If gc_refs == GC_UNTRACKED, it must be ignored.
*/
else {
assert(gc_refs > 0
|| gc_refs == GC_REACHABLE
|| gc_refs == GC_UNTRACKED);
}
}
return 0;
}
- 如果
gc_refs为0,表示还没被遍历到,就设置为1,表示引用关系,为不可回收的对象。 - 如果
gc_refs是GC_TENTATIVELY_UNREACHABLE标记,需要将被移到unreachable链表的对象再次移出来到young“代”的container链表中,并设置gc_refs为1。
[2] 将可达的对象移到下一“代”。
垃圾清除
// [1]
/* All objects in unreachable are trash, but objects reachable from
* legacy finalizers (e.g. tp_del) can't safely be deleted.
*/
gc_list_init(&finalizers);
move_legacy_finalizers(&unreachable, &finalizers);
/* finalizers contains the unreachable objects with a legacy finalizer;
* unreachable objects reachable *from* those are also uncollectable,
* and we move those into the finalizers list too.
*/
move_legacy_finalizer_reachable(&finalizers);
// [2]
/* Collect statistics on collectable objects found and print
* debugging information.
*/
for (gc = unreachable.gc.gc_next; gc != &unreachable;
gc = gc->gc.gc_next) {
m++;
}
// [3]
/* Clear weakrefs and invoke callbacks as necessary. */
m += handle_weakrefs(&unreachable, old);
// [4]
/* Call tp_finalize on objects which have one. */
finalize_garbage(&unreachable);
// [5]
if (check_garbage(&unreachable)) {
revive_garbage(&unreachable);
gc_list_merge(&unreachable, old);
}
else {
/* Call tp_clear on objects in the unreachable set. This will cause
* the reference cycles to be broken. It may also cause some objects
* in finalizers to be freed.
*/
delete_garbage(&unreachable, old);
}
// [6]
/* Collect statistics on uncollectable objects found and print
* debugging information. */
for (gc = finalizers.gc.gc_next;
gc != &finalizers;
gc = gc->gc.gc_next) {
n++;
}
...
// [7]
/* Append instances in the uncollectable set to a Python
* reachable list of garbage. The programmer has to deal with
* this if they insist on creating this type of structure.
*/
(void)handle_legacy_finalizers(&finalizers, old);
/* Clear free list only during the collection of the highest
* generation */
if (generation == NUM_GENERATIONS-1) {
clear_freelists();
}
[1] 处理unreachable链表中有finalizer的对象。
/* Move the objects in unreachable with tp_del slots into `finalizers`.
* Objects moved into `finalizers` have gc_refs set to GC_REACHABLE; the
* objects remaining in unreachable are left at GC_TENTATIVELY_UNREACHABLE.
*/
static void
move_legacy_finalizers(PyGC_Head *unreachable, PyGC_Head *finalizers)
{
PyGC_Head *gc;
PyGC_Head *next;
/* March over unreachable. Move objects with finalizers into
* `finalizers`.
*/
for (gc = unreachable->gc.gc_next; gc != unreachable; gc = next) {
PyObject *op = FROM_GC(gc);
assert(IS_TENTATIVELY_UNREACHABLE(op));
next = gc->gc.gc_next;
if (has_legacy_finalizer(op)) {
gc_list_move(gc, finalizers);
_PyGCHead_SET_REFS(gc, GC_REACHABLE);
}
}
}
/* Return true if object has a pre-PEP 442 finalization method. */
static int
has_legacy_finalizer(PyObject *op)
{
return op->ob_type->tp_del != NULL;
}
遍历unreachable链表,将拥有finalizer的实例对象移到finalizers链表中,并标示为GC_REACHABLE。拥有finalizer的实例对象指的就是实现了tp_del函数的对象。
/* Move objects that are reachable from finalizers, from the unreachable set
* into finalizers set.
*/
static void
move_legacy_finalizer_reachable(PyGC_Head *finalizers)
{
traverseproc traverse;
PyGC_Head *gc = finalizers->gc.gc_next;
for (; gc != finalizers; gc = gc->gc.gc_next) {
/* Note that the finalizers list may grow during this. */
traverse = Py_TYPE(FROM_GC(gc))->tp_traverse;
(void) traverse(FROM_GC(gc),
(visitproc)visit_move,
(void *)finalizers);
}
}
对finalizers链表中拥有finalizer的实例对象遍历其引用对象,调用visit_move访问者,这些被引用的对象也不应该被释放。
/* A traversal callback for move_legacy_finalizer_reachable. */
static int
visit_move(PyObject *op, PyGC_Head *tolist)
{
if (PyObject_IS_GC(op)) {
if (IS_TENTATIVELY_UNREACHABLE(op)) {
PyGC_Head *gc = AS_GC(op);
gc_list_move(gc, tolist);
_PyGCHead_SET_REFS(gc, GC_REACHABLE);
}
}
return 0;
}
#define IS_TENTATIVELY_UNREACHABLE(o) ( \
_PyGC_REFS(o) == GC_TENTATIVELY_UNREACHABLE)
visit_move函数将引用对象还在unreachable链表的对象移到finalizers链表中。
[2] 统计unreachable链表数量。
[3] 处理弱引用。
[4][5] 开始清除垃圾对象,我们先只看delete_garbage函数:
/* Break reference cycles by clearing the containers involved. This is
* tricky business as the lists can be changing and we don't know which
* objects may be freed. It is possible I screwed something up here.
*/
static void
delete_garbage(PyGC_Head *collectable, PyGC_Head *old)
{
inquiry clear;
while (!gc_list_is_empty(collectable)) {
PyGC_Head *gc = collectable->gc.gc_next;
PyObject *op = FROM_GC(gc);
if (_PyRuntime.gc.debug & DEBUG_SAVEALL) {
PyList_Append(_PyRuntime.gc.garbage, op);
}
else {
if ((clear = Py_TYPE(op)->tp_clear) != NULL) {
Py_INCREF(op);
clear(op);
Py_DECREF(op);
}
}
if (collectable->gc.gc_next == gc) {
/* object is still alive, move it, it may die later */
gc_list_move(gc, old);
_PyGCHead_SET_REFS(gc, GC_REACHABLE);
}
}
}
遍历unreachable链表中的container对象,调用其类型对象的tp_clear指针指向的函数,我们以list对象为例:
static int
_list_clear(PyListObject *a)
{
Py_ssize_t i;
PyObject **item = a->ob_item;
if (item != NULL) {
/* Because XDECREF can recursively invoke operations on
this list, we make it empty first. */
i = Py_SIZE(a);
Py_SIZE(a) = 0;
a->ob_item = NULL;
a->allocated = 0;
while (--i >= 0) {
Py_XDECREF(item[i]);
}
PyMem_FREE(item);
}
/* Never fails; the return value can be ignored.
Note that there is no guarantee that the list is actually empty
at this point, because XDECREF may have populated it again! */
return 0;
}
_list_clear函数对container对象的每个元素进行引用数减量操作并释放container对象内存。
delete_garbage在对container对象进行clear操作之后,还会检查是否成功,如果该container对象没有从unreachable链表上摘除,表示container对象还不能销毁,需要放回到老一“代”中,并标记GC_REACHABLE。
[6] 统计finalizers链表数量。
[7] 处理finalizers链表的对象。
/* Handle uncollectable garbage (cycles with tp_del slots, and stuff reachable
* only from such cycles).
* If DEBUG_SAVEALL, all objects in finalizers are appended to the module
* garbage list (a Python list), else only the objects in finalizers with
* __del__ methods are appended to garbage. All objects in finalizers are
* merged into the old list regardless.
* Returns 0 if all OK, <0 on error (out of memory to grow the garbage list).
* The finalizers list is made empty on a successful return.
*/
static int
handle_legacy_finalizers(PyGC_Head *finalizers, PyGC_Head *old)
{
PyGC_Head *gc = finalizers->gc.gc_next;
if (_PyRuntime.gc.garbage == NULL) {
_PyRuntime.gc.garbage = PyList_New(0);
if (_PyRuntime.gc.garbage == NULL)
Py_FatalError("gc couldn't create gc.garbage list");
}
for (; gc != finalizers; gc = gc->gc.gc_next) {
PyObject *op = FROM_GC(gc);
if ((_PyRuntime.gc.debug & DEBUG_SAVEALL) || has_legacy_finalizer(op)) {
if (PyList_Append(_PyRuntime.gc.garbage, op) < 0)
return -1;
}
}
gc_list_merge(finalizers, old);
return 0;
}
遍历finalizers链表,将拥有finalizer的实例对象放到一个名为garbage的PyListObject对象中,可以通过gc模块查看。
>>> import gc
>>> gc.garbage
然后把finalizers链表晋升到老一“代”。
火前留名 = =
@yetingsky - -!
大佬大佬
太详细了
实现了__del__方法的对象,会被放入finalizers中,并在最后与old链表合并,这么说的话,这个对象并没有被gc回收,我在另外一篇文章中看到说在pep-0442 解决了这个问题。我去看了没看懂(泪目)但是就目前的源码来看并没有进行回收。
与old合并的话,如果old达到阈值触发gc,仍然是finalizers,这样不是死循环了吗,到底被触发GC回收呢? 望解答,谢谢!
@Panlq 并不是“实现了__del__方法的对象,会被放入finalizers中”,上面提到的是
拥有finalizer的实例对象指的就是实现了 tp_del 函数的对象。
在Python3.4 之前 __del__ 方法确实对应 tp_del 函数,3.4之后引入 tp_finalize 来表示 __del__。

而 pep0442 解决的是在3.4 之前,实现__del__ 的对象不会被回收的问题
[I] ➜ ipython
Python 3.3.7 (default, Jun 12 2020, 17:08:02)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: class A:
...: def __init__(self, name): self.name = name
...: def __del__(self): print('del', id(self))
...:
In [2]: a = A(None)
In [3]: b = A(a)
In [4]: a.name=b
In [5]: del a
In [6]: del b
In [7]: import gc
In [8]: gc.collect()
Out[8]: 21
In [9]: gc.garbage
Out[9]: [<__main__.A at 0x10985e810>, <__main__.A at 0x109860f10>]
3.4 之后:
In [1]: class A:
...: def __init__(self, name): self.name = name
...: def __del__(self): print('del', id(self))
...:
In [2]: a = A(None)
In [3]: b = A(a)
In [4]: a.name=b
In [5]: del a
In [6]: del b
In [7]: import gc
In [8]: gc.collect()
del 4409022448
del 4409023792
Out[8]: 32
In [9]: gc.garbage
Out[9]: []
@Panlq
与old合并的话,如果old达到阈值触发gc,仍然是finalizers,这样不是死循环了吗,到底被触发GC回收呢?
这个我没太明白你的描述,你是指触发gc之后,old代的count会一直超过threshold么? 如果是这个意思的话,那么需要知道的是count不是只表示数量(只有0代是),1代或2代的count表示的是回收次数,而且每次代垃圾回收的时候,count会置为0。
@Junnplus
1.
我自己也尝试了一下,确实是会回收,但在官方文档中声明了 当解释器退出时不会确保为仍然存在的对象调用 del() 方法
PS:之前开发中就是有一次在类中写了
__del__这个方法,在linux parallel 模式下,有时候会有留下一些僵尸进程,后来把__del__去掉后,就没有了。当时理解应该就是__del__没释放内存导致进程占用。
官方对__del__声明: del x 并不直接调用 x.del() --- 前者会将 x 的引用计数减一,而后者仅会在 x 的引用计数变为零时被调用。 根据上面运行结果我理解如下:
当实现__del__方法的对象之间存在循环引用的时候,当调用gc ,subtract 循环引用后,也就会调用对象各自的__del__方法了,所以打印了
> In [8]: gc.collect()
> del 4409022448
> del 4409023792
> Out[8]: 32
2. 拥有finalizer的实例对象指的就是实现了 tp_del 函数的对象。
这个是对的,但看代码注释的,实现了__del__方法的对象也是会被放入到finalizers的吧
/* Handle uncollectable garbage (cycles with tp_del slots, and stuff reachable
- only from such cycles).
- If DEBUG_SAVEALL, all objects in finalizers are appended to the module
- garbage list (a Python list), else only the objects in finalizers with
- del methods are appended to garbage. All objects in finalizers are
- merged into the old list regardless.
- Returns 0 if all OK, <0 on error (out of memory to grow the garbage list).
- The finalizers list is made empty on a successful return. 意思是finalizers会被释放并设置为empty数组? */ static int handle_legacy_finalizers(PyGC_Head *finalizers, PyGC_Head *old){}
- PyGC_Head finalizers; /* objects with, & reachable from, del */
@Panlq
与old合并的话,如果old达到阈值触发gc,仍然是finalizers,这样不是死循环了吗,到底被触发GC回收呢?
这个我没太明白你的描述,你是指触发gc之后,old代的count会一直超过threshold么? 如果是这个意思的话,那么需要知道的是count不是只表示数量(只有0代是),1代或2代的count表示的是回收次数,而且每次代垃圾回收的时候,count会置为0。
抱歉我没表达清楚,我的意思是finalizers这个链表实在什么时候被清除的,就您分析的流程来看,finalizers会一直与old合并。
PS: 交互模式下执行代码跟用IDE运行程序可能有不一样,在交互模式也会占用引用,在退出交互模式的时候也会调用__del__
@Panlq 我不明白你的困惑在哪里 - - 另外,上面摘取代码注释应该好几段混在一起了吧,可读性太差了。。。
当实现__del__方法的对象之间存在循环引用的时候,当调用gc ,subtract 循环引用后,也就会调用对象各自的__del__方法了,所以打印了
subtract_refs 处理循环引用之后,会将不可达对象收集到unreachable, 然后通过move_legacy_finalizers收集finalizers(实现tp_del的对象),move_legacy_finalizer_reachable 将finalizers 的引用对象也放到finalizers里面。finalizers最终会合并到old里面。
finalize_garbage则是执行不可达对象的 __del__( tp_finalize 函数),且保证只执行一次。
实现了__del__方法的对象也是会被放入到finalizers的吧
这个应该是错的。
/* Handle uncollectable garbage (cycles with tp_del slots, and stuff reachable
* only from such cycles).
* If DEBUG_SAVEALL, all objects in finalizers are appended to the module
* garbage list (a Python list), else only the objects in finalizers with
* __del__ methods are appended to garbage. All objects in finalizers are
* merged into the old list regardless.
*/
static void
handle_legacy_finalizers(PyGC_Head *finalizers, PyGC_Head *old)
注释是3.4之前的注释,你看下 has_legacy_finalizer 的实现就知道 - - 可以看下 pep442 的 pr https://github.com/python/cpython/commit/796564c27b8f2e32b9fbc034bbdda75f9507ca43#diff-a641cb0b9fbd1c5ca0c64ce06e2cbcbd
我的意思是finalizers这个链表实在什么时候被清除的
为什么finalizers需要清除?finalizers是一些实现了tp_del的函数,finalizers最后会被放到gc.garbage的列表中。 关于 garbage 的解释:
A list of objects which the collector found to be unreachable but could
not be freed (uncollectable objects). Starting with Python 3.4, this
list should be empty most of the time, except when using instances of
C extension types with a non-NULL ``tp_del`` slot.
可以看出来tp_del函数主要是为了兼容之前的代码。
我们注意到这样一个事实,如果两个对象的引用计数都为1,但是仅仅存在它们之间的循环引用,那么这两个对象都是需要被回收的
这一段不太明白。这只是逻辑层说明这种也是需要被清除么,还是垃圾回收有这个代码操作。我理解只要是跟对象不可达的对象就可以被清除吧。不需要做这个判断。
我们注意到这样一个事实,如果两个对象的引用计数都为1,但是仅仅存在它们之间的循环引用,那么这两个对象都是需要被回收的
这一段不太明白。这只是逻辑层说明这种也是需要被清除么,还是垃圾回收有这个代码操作。我理解只要是跟对象不可达的对象就可以被清除吧。不需要做这个判断。
这是逻辑上的说法,对象之间形成了循环引用,引用计数没办法清楚这些对象,所以需要引入标记清除