jieba icon indicating copy to clipboard operation
jieba copied to clipboard

Paddle模式下,执行大量次数分词时程序卡住

Open ShimmeringLight opened this issue 4 years ago • 5 comments

描述:使用Paddle对数据集分词,共70W+条数据,但是在运行到第559987条(另一台机器上是551421条)时程序卡住,不继续运行。 尝试:使用joblib多线程,仍然出现这个问题 代码:

def cut(word, stop_words=None, index=-1, remind=1000):
    # print(jieba.check_paddle_install['is_paddle_installed'])
    if not jieba.check_paddle_install['is_paddle_installed']:
        jieba.enable_paddle()
    try:
        if index == 559987:
            print('*******' + word)
        if index % remind == 0:
            print('已读入{}条数据'.format(index))
        seg_list = jieba.cut(word, use_paddle=False)
        if stop_words is None:
            stop_words = []
            return ' '.join(seg_list)
        cut_word = ''
        for w in seg_list:
            if w not in stop_words and 'x' not in w:
                cut_word += w
                cut_word += ' '
        return cut_word.strip(' ')
    except:
        return ''
def main():
x = Parallel(n_jobs=7, backend="multiprocessing")(
        delayed(cut)(raw_content[i], stop_words, i) for i in range(len(raw_content)))
# .......若干代码

ShimmeringLight avatar Nov 29 '20 11:11 ShimmeringLight

单线程时报错

W1129 12:23:37.086438 21938 init.cc:205] *** Aborted at 1606652617 (unix time) try "date -d @1606652617" if you are using GNU date ***
W1129 12:23:37.087587 21938 init.cc:205] PC: @                0x0 (unknown)
W1129 12:23:37.087682 21938 init.cc:205] *** SIGFPE (@0x7fe71dc82988) received by PID 21938 (TID 0x7fe757347740) from PID 499657096; stack trace: ***
W1129 12:23:37.088780 21938 init.cc:205]     @     0x7fe7576b83c0 (unknown)
W1129 12:23:37.093000 21938 init.cc:205]     @     0x7fe71dc82988 paddle::operators::math::RowwiseAdd<>::operator()()
W1129 12:23:37.096590 21938 init.cc:205]     @     0x7fe71d437a4e paddle::operators::GRUCPUKernel<>::BatchCompute()
W1129 12:23:37.098675 21938 init.cc:205]     @     0x7fe71d4383b0 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform8CPUPlaceELb0ELm0EINS0_9operators12GRUCPUKernelIfEENSA_IdEEEEclEPKcSF_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
W1129 12:23:37.102587 21938 init.cc:205]     @     0x7fe71dc4e51d paddle::framework::OperatorWithKernel::RunImpl()
W1129 12:23:37.105298 21938 init.cc:205]     @     0x7fe71dc4eecb paddle::framework::OperatorWithKernel::RunImpl()
W1129 12:23:37.109793 21938 init.cc:205]     @     0x7fe71dc4947a paddle::framework::OperatorBase::Run()
W1129 12:23:37.112777 21938 init.cc:205]     @     0x7fe71ca9f79e paddle::framework::Executor::RunPreparedContext()
W1129 12:23:37.112872 21938 init.cc:205]     @     0x7fe71c8fedd1 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL24pybind11_init_core_noavxERNS_6moduleEEUlRNS2_9framework8ExecutorEPNS6_22ExecutorPrepareContextEPNS6_5ScopeEbbbE101_vIS8_SA_SC_bbbEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESU_
W1129 12:23:37.114048 21938 init.cc:205]     @     0x7fe71c9518a4 pybind11::cpp_function::dispatcher()
W1129 12:23:37.114703 21938 init.cc:205]     @     0x557d19cbd914 _PyMethodDef_RawFastCallKeywords
W1129 12:23:37.115320 21938 init.cc:205]     @     0x557d19cbda31 _PyCFunction_FastCallKeywords
W1129 12:23:37.115932 21938 init.cc:205]     @     0x557d19d2a39e _PyEval_EvalFrameDefault
W1129 12:23:37.116493 21938 init.cc:205]     @     0x557d19c6c829 _PyEval_EvalCodeWithName
W1129 12:23:37.117017 21938 init.cc:205]     @     0x557d19cbd107 _PyFunction_FastCallKeywords
W1129 12:23:37.117674 21938 init.cc:205]     @     0x557d19d26585 _PyEval_EvalFrameDefault
W1129 12:23:37.118238 21938 init.cc:205]     @     0x557d19c6c829 _PyEval_EvalCodeWithName
W1129 12:23:37.118762 21938 init.cc:205]     @     0x557d19cbd107 _PyFunction_FastCallKeywords
W1129 12:23:37.119379 21938 init.cc:205]     @     0x557d19d26585 _PyEval_EvalFrameDefault
W1129 12:23:37.119940 21938 init.cc:205]     @     0x557d19c6c829 _PyEval_EvalCodeWithName
W1129 12:23:37.120467 21938 init.cc:205]     @     0x557d19cbd107 _PyFunction_FastCallKeywords
W1129 12:23:37.121073 21938 init.cc:205]     @     0x557d19d26585 _PyEval_EvalFrameDefault
W1129 12:23:37.121629 21938 init.cc:205]     @     0x557d19cbce7b _PyFunction_FastCallKeywords
W1129 12:23:37.122239 21938 init.cc:205]     @     0x557d19d29b29 _PyEval_EvalFrameDefault
W1129 12:23:37.122547 21938 init.cc:205]     @     0x557d19cc4fb4 gen_send_ex
W1129 12:23:37.123116 21938 init.cc:205]     @     0x557d19c6c164 _PyList_Extend
W1129 12:23:37.123739 21938 init.cc:205]     @     0x557d19c6c3f2 PySequence_List
W1129 12:23:37.124413 21938 init.cc:205]     @     0x557d19c6c464 PySequence_Fast
W1129 12:23:37.124961 21938 init.cc:205]     @     0x557d19c6c4a9 PyUnicode_Join
W1129 12:23:37.125635 21938 init.cc:205]     @     0x557d19cbd72d _PyMethodDef_RawFastCallKeywords
W1129 12:23:37.126222 21938 init.cc:205]     @     0x557d19cc47af _PyMethodDescr_FastCallKeywords
W1129 12:23:37.126834 21938 init.cc:205]     @     0x557d19d29c7c _PyEval_EvalFrameDefault
[1]  + 21938 floating point exception (core dumped)  nohup python3 -u main.py > log.txt 2>&1

ShimmeringLight avatar Nov 29 '20 12:11 ShimmeringLight

这个问题解决了吗?出现了同样的问题

xixy avatar May 10 '21 07:05 xixy

这个问题解决了吗?出现了同样的问题

有没有出来解释一下啥的

dxu-slc avatar Mar 30 '22 02:03 dxu-slc

应该是数据里面有空数据 ' ' or '\n'

dxu-slc avatar Mar 30 '22 02:03 dxu-slc

floating point exception 看起来像是哪里的数值问题

shouldsee avatar Apr 03 '22 15:04 shouldsee