pyahocorasick icon indicating copy to clipboard operation
pyahocorasick copied to clipboard

memory usage increases after fork the process

Open gladtosee opened this issue 5 years ago • 6 comments

@WojciechMula hi? I am using the pyahocorasick well. But I have a problem.

A minor-page fault occurs, which increases the memory usage of the child process. (https://en.wikipedia.org/wiki/Copy-on-write) I forked after using gc.freeze(), but a page fault occurred. (https://docs.python.org/3/library/gc.html#gc.freeze)

What should I do??

I used perf to get the following results. perf record -e minor-faults -g -p PID

In trace_begin:
                                                            
In trace_end:                                               
                                                            
There is 386377 records in gen_events table                 
Statistics about the general events grouped by thread/symbol/dso: 
                                                            
                                                            
            comm   number        histogram                  
==========================================                  
          python   340070     ###################           
         python3    46307     ################              
                                                            
                          symbol   number        histogram  
==========================================================  
      automaton_search_iter_next   242330     ################## 
          automaton_build_output    30382     ############### 
                      do_mktuple    23836     ############### 
                 PyObject_Malloc    17587     ###############  
                     _int_malloc    14089     ##############
               trienode_get_next    10282     ##############
                 PyMember_GetOne    10084     ##############
        _PyEval_EvalFrameDefault     7855     ############# 
        lookdict_unicode_nodummy     5679     ############# 
                      do_mkvalue     5643     ############# 
                  dict_subscript     3694     ############  
                         collect     1668     ###########   
                    visit_decref     1440     ###########   
          PyObject_GetAttrString     1376     ###########   
                   PyMem_Realloc     1178     ###########   
                pymalloc_realloc      837     ##########    
                  bytearray_init      830     ##########    
                    PyMem_Malloc      728     ##########    
                   PyList_Append      632     ##########    
                  _PyList_Extend      627     ##########    
                   dict_traverse      626     ##########    
                  tupleiter_next      564     ##########    
              malloc_consolidate      412     #########     
       PyBytes_FromStringAndSize      375     #########     
                   List_iterNext      271     #########     
            stringlib_bytes_join      225     ########      
                         set_add      219     ########      
                     PyTuple_New      213     ########      
                    PyMem_Calloc      211     ########      
         Object_beginTypeContext      176     ########      
            _PyFrame_New_NoTrack      171     ########      
             PyObject_GC_UnTrack      128     ########      
                        dict_get      116     #######       
                       sysmalloc       99     #######       
                PyObject_GetAttr       81     #######       
                   PyObject_Free       71     #######       

gladtosee avatar May 20 '19 16:05 gladtosee

FYI https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172 https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf

gladtosee avatar May 20 '19 16:05 gladtosee

@gladtosee To be honest I wasn't aware of this problem, you are the first one mentioning it. I need to learn a little bit about this issue. Thanks for these articles.

WojciechMula avatar May 20 '19 17:05 WojciechMula

@WojciechMula After loading the data from the master process, the child process increments the ref count and causes a Copy-On-Write. Because Py_BuildValue() increase the reference count before returning from the automaton_build_output function.

#define Py_INCREF(op) (                         \
    _Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
    ((PyObject *)(op))->ob_refcnt++)

If i modify the code like this: copy on write does not happen.

//copy from cpython source - https://github.com/python/cpython/blob/v3.7.3/Objects/unicodeobject.c#L2380
PyObject*
_PyUnicode_Copy(PyObject *unicode)
{
    Py_ssize_t length;
    PyObject *copy;
 
    if (!PyUnicode_Check(unicode)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    if (PyUnicode_READY(unicode) == -1)
        return NULL;
 
    length = PyUnicode_GET_LENGTH(unicode);
    copy = PyUnicode_New(length, PyUnicode_MAX_CHAR_VALUE(unicode));
    if (!copy)
        return NULL;
    assert(PyUnicode_KIND(copy) == PyUnicode_KIND(unicode));
 
    memcpy(PyUnicode_DATA(copy), PyUnicode_DATA(unicode),
           length * PyUnicode_KIND(unicode));
//    assert(_PyUnicode_CheckConsistency(copy, 1));
    return copy;
}

static int automaton_build_output(PyObject* self, PyObject** result);

case STORE_ANY:
    if(PyUnicode_Check(node->output.object)) {
        //N: Same as O, except it doesn’t increment the reference count on the object.
        *result = F(Py_BuildValue)("iN", idx, _PyUnicode_Copy(node->output.object));
    }
    else {
        *result = F(Py_BuildValue)("iO", idx, node->output.object);
    }
    return OutputValue;

gladtosee avatar May 30 '19 10:05 gladtosee

@gladtosee Could you please provide a patch for this?

WojciechMula avatar Oct 28 '19 18:10 WojciechMula

@WojciechMula After loading the data from the master process, the child process increments the ref count and causes a Copy-On-Write. Because Py_BuildValue() increase the reference count before returning from the automaton_build_output function.

#define Py_INCREF(op) (                         \
    _Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
    ((PyObject *)(op))->ob_refcnt++)

If i modify the code like this: copy on write does not happen.

//copy from cpython source - https://github.com/python/cpython/blob/v3.7.3/Objects/unicodeobject.c#L2380
PyObject*
_PyUnicode_Copy(PyObject *unicode)
{
    Py_ssize_t length;
    PyObject *copy;
 
    if (!PyUnicode_Check(unicode)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    if (PyUnicode_READY(unicode) == -1)
        return NULL;
 
    length = PyUnicode_GET_LENGTH(unicode);
    copy = PyUnicode_New(length, PyUnicode_MAX_CHAR_VALUE(unicode));
    if (!copy)
        return NULL;
    assert(PyUnicode_KIND(copy) == PyUnicode_KIND(unicode));
 
    memcpy(PyUnicode_DATA(copy), PyUnicode_DATA(unicode),
           length * PyUnicode_KIND(unicode));
//    assert(_PyUnicode_CheckConsistency(copy, 1));
    return copy;
}

static int automaton_build_output(PyObject* self, PyObject** result);

case STORE_ANY:
    if(PyUnicode_Check(node->output.object)) {
        //N: Same as O, except it doesn’t increment the reference count on the object.
        *result = F(Py_BuildValue)("iN", idx, _PyUnicode_Copy(node->output.object));
    }
    else {
        *result = F(Py_BuildValue)("iO", idx, node->output.object);
    }
    return OutputValue;

I tried to use your code and reinstall, but there are some errors. Symbol not found: _PyUnicode_DATA Would you give me more details about how you solve your problem

yuanchaofa avatar Jun 28 '20 12:06 yuanchaofa

@yuanchaofa do you mind to provide a PR or patch? it would be much easier to review. Thanks!

pombredanne avatar Feb 20 '22 11:02 pombredanne