pyahocorasick
pyahocorasick copied to clipboard
memory usage increases after fork the process
@WojciechMula hi? I am using the pyahocorasick well. But I have a problem.
A minor-page fault occurs, which increases the memory usage of the child process. (https://en.wikipedia.org/wiki/Copy-on-write) I forked after using gc.freeze(), but a page fault occurred. (https://docs.python.org/3/library/gc.html#gc.freeze)
What should I do??
I used perf to get the following results. perf record -e minor-faults -g -p PID
In trace_begin:
In trace_end:
There is 386377 records in gen_events table
Statistics about the general events grouped by thread/symbol/dso:
comm number histogram
==========================================
python 340070 ###################
python3 46307 ################
symbol number histogram
==========================================================
automaton_search_iter_next 242330 ##################
automaton_build_output 30382 ###############
do_mktuple 23836 ###############
PyObject_Malloc 17587 ###############
_int_malloc 14089 ##############
trienode_get_next 10282 ##############
PyMember_GetOne 10084 ##############
_PyEval_EvalFrameDefault 7855 #############
lookdict_unicode_nodummy 5679 #############
do_mkvalue 5643 #############
dict_subscript 3694 ############
collect 1668 ###########
visit_decref 1440 ###########
PyObject_GetAttrString 1376 ###########
PyMem_Realloc 1178 ###########
pymalloc_realloc 837 ##########
bytearray_init 830 ##########
PyMem_Malloc 728 ##########
PyList_Append 632 ##########
_PyList_Extend 627 ##########
dict_traverse 626 ##########
tupleiter_next 564 ##########
malloc_consolidate 412 #########
PyBytes_FromStringAndSize 375 #########
List_iterNext 271 #########
stringlib_bytes_join 225 ########
set_add 219 ########
PyTuple_New 213 ########
PyMem_Calloc 211 ########
Object_beginTypeContext 176 ########
_PyFrame_New_NoTrack 171 ########
PyObject_GC_UnTrack 128 ########
dict_get 116 #######
sysmalloc 99 #######
PyObject_GetAttr 81 #######
PyObject_Free 71 #######
FYI https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172 https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf
@gladtosee To be honest I wasn't aware of this problem, you are the first one mentioning it. I need to learn a little bit about this issue. Thanks for these articles.
@WojciechMula After loading the data from the master process, the child process increments the ref count and causes a Copy-On-Write. Because Py_BuildValue() increase the reference count before returning from the automaton_build_output function.
#define Py_INCREF(op) ( \
_Py_INC_REFTOTAL _Py_REF_DEBUG_COMMA \
((PyObject *)(op))->ob_refcnt++)
If i modify the code like this: copy on write does not happen.
//copy from cpython source - https://github.com/python/cpython/blob/v3.7.3/Objects/unicodeobject.c#L2380
PyObject*
_PyUnicode_Copy(PyObject *unicode)
{
Py_ssize_t length;
PyObject *copy;
if (!PyUnicode_Check(unicode)) {
PyErr_BadInternalCall();
return NULL;
}
if (PyUnicode_READY(unicode) == -1)
return NULL;
length = PyUnicode_GET_LENGTH(unicode);
copy = PyUnicode_New(length, PyUnicode_MAX_CHAR_VALUE(unicode));
if (!copy)
return NULL;
assert(PyUnicode_KIND(copy) == PyUnicode_KIND(unicode));
memcpy(PyUnicode_DATA(copy), PyUnicode_DATA(unicode),
length * PyUnicode_KIND(unicode));
// assert(_PyUnicode_CheckConsistency(copy, 1));
return copy;
}
static int automaton_build_output(PyObject* self, PyObject** result);
case STORE_ANY:
if(PyUnicode_Check(node->output.object)) {
//N: Same as O, except it doesn’t increment the reference count on the object.
*result = F(Py_BuildValue)("iN", idx, _PyUnicode_Copy(node->output.object));
}
else {
*result = F(Py_BuildValue)("iO", idx, node->output.object);
}
return OutputValue;
@gladtosee Could you please provide a patch for this?
@WojciechMula After loading the data from the master process, the child process increments the ref count and causes a Copy-On-Write. Because Py_BuildValue() increase the reference count before returning from the automaton_build_output function.
#define Py_INCREF(op) ( \ _Py_INC_REFTOTAL _Py_REF_DEBUG_COMMA \ ((PyObject *)(op))->ob_refcnt++)
If i modify the code like this: copy on write does not happen.
//copy from cpython source - https://github.com/python/cpython/blob/v3.7.3/Objects/unicodeobject.c#L2380 PyObject* _PyUnicode_Copy(PyObject *unicode) { Py_ssize_t length; PyObject *copy; if (!PyUnicode_Check(unicode)) { PyErr_BadInternalCall(); return NULL; } if (PyUnicode_READY(unicode) == -1) return NULL; length = PyUnicode_GET_LENGTH(unicode); copy = PyUnicode_New(length, PyUnicode_MAX_CHAR_VALUE(unicode)); if (!copy) return NULL; assert(PyUnicode_KIND(copy) == PyUnicode_KIND(unicode)); memcpy(PyUnicode_DATA(copy), PyUnicode_DATA(unicode), length * PyUnicode_KIND(unicode)); // assert(_PyUnicode_CheckConsistency(copy, 1)); return copy; } static int automaton_build_output(PyObject* self, PyObject** result); case STORE_ANY: if(PyUnicode_Check(node->output.object)) { //N: Same as O, except it doesn’t increment the reference count on the object. *result = F(Py_BuildValue)("iN", idx, _PyUnicode_Copy(node->output.object)); } else { *result = F(Py_BuildValue)("iO", idx, node->output.object); } return OutputValue;
I tried to use your code and reinstall, but there are some errors. Symbol not found: _PyUnicode_DATA Would you give me more details about how you solve your problem
@yuanchaofa do you mind to provide a PR or patch? it would be much easier to review. Thanks!