jieba
jieba copied to clipboard
Paddle模式下,执行大量次数分词时程序卡住
描述:使用Paddle对数据集分词,共70W+条数据,但是在运行到第559987条(另一台机器上是551421条)时程序卡住,不继续运行。 尝试:使用joblib多线程,仍然出现这个问题 代码:
def cut(word, stop_words=None, index=-1, remind=1000):
# print(jieba.check_paddle_install['is_paddle_installed'])
if not jieba.check_paddle_install['is_paddle_installed']:
jieba.enable_paddle()
try:
if index == 559987:
print('*******' + word)
if index % remind == 0:
print('已读入{}条数据'.format(index))
seg_list = jieba.cut(word, use_paddle=False)
if stop_words is None:
stop_words = []
return ' '.join(seg_list)
cut_word = ''
for w in seg_list:
if w not in stop_words and 'x' not in w:
cut_word += w
cut_word += ' '
return cut_word.strip(' ')
except:
return ''
def main():
x = Parallel(n_jobs=7, backend="multiprocessing")(
delayed(cut)(raw_content[i], stop_words, i) for i in range(len(raw_content)))
# .......若干代码
单线程时报错
W1129 12:23:37.086438 21938 init.cc:205] *** Aborted at 1606652617 (unix time) try "date -d @1606652617" if you are using GNU date ***
W1129 12:23:37.087587 21938 init.cc:205] PC: @ 0x0 (unknown)
W1129 12:23:37.087682 21938 init.cc:205] *** SIGFPE (@0x7fe71dc82988) received by PID 21938 (TID 0x7fe757347740) from PID 499657096; stack trace: ***
W1129 12:23:37.088780 21938 init.cc:205] @ 0x7fe7576b83c0 (unknown)
W1129 12:23:37.093000 21938 init.cc:205] @ 0x7fe71dc82988 paddle::operators::math::RowwiseAdd<>::operator()()
W1129 12:23:37.096590 21938 init.cc:205] @ 0x7fe71d437a4e paddle::operators::GRUCPUKernel<>::BatchCompute()
W1129 12:23:37.098675 21938 init.cc:205] @ 0x7fe71d4383b0 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform8CPUPlaceELb0ELm0EINS0_9operators12GRUCPUKernelIfEENSA_IdEEEEclEPKcSF_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
W1129 12:23:37.102587 21938 init.cc:205] @ 0x7fe71dc4e51d paddle::framework::OperatorWithKernel::RunImpl()
W1129 12:23:37.105298 21938 init.cc:205] @ 0x7fe71dc4eecb paddle::framework::OperatorWithKernel::RunImpl()
W1129 12:23:37.109793 21938 init.cc:205] @ 0x7fe71dc4947a paddle::framework::OperatorBase::Run()
W1129 12:23:37.112777 21938 init.cc:205] @ 0x7fe71ca9f79e paddle::framework::Executor::RunPreparedContext()
W1129 12:23:37.112872 21938 init.cc:205] @ 0x7fe71c8fedd1 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL24pybind11_init_core_noavxERNS_6moduleEEUlRNS2_9framework8ExecutorEPNS6_22ExecutorPrepareContextEPNS6_5ScopeEbbbE101_vIS8_SA_SC_bbbEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESU_
W1129 12:23:37.114048 21938 init.cc:205] @ 0x7fe71c9518a4 pybind11::cpp_function::dispatcher()
W1129 12:23:37.114703 21938 init.cc:205] @ 0x557d19cbd914 _PyMethodDef_RawFastCallKeywords
W1129 12:23:37.115320 21938 init.cc:205] @ 0x557d19cbda31 _PyCFunction_FastCallKeywords
W1129 12:23:37.115932 21938 init.cc:205] @ 0x557d19d2a39e _PyEval_EvalFrameDefault
W1129 12:23:37.116493 21938 init.cc:205] @ 0x557d19c6c829 _PyEval_EvalCodeWithName
W1129 12:23:37.117017 21938 init.cc:205] @ 0x557d19cbd107 _PyFunction_FastCallKeywords
W1129 12:23:37.117674 21938 init.cc:205] @ 0x557d19d26585 _PyEval_EvalFrameDefault
W1129 12:23:37.118238 21938 init.cc:205] @ 0x557d19c6c829 _PyEval_EvalCodeWithName
W1129 12:23:37.118762 21938 init.cc:205] @ 0x557d19cbd107 _PyFunction_FastCallKeywords
W1129 12:23:37.119379 21938 init.cc:205] @ 0x557d19d26585 _PyEval_EvalFrameDefault
W1129 12:23:37.119940 21938 init.cc:205] @ 0x557d19c6c829 _PyEval_EvalCodeWithName
W1129 12:23:37.120467 21938 init.cc:205] @ 0x557d19cbd107 _PyFunction_FastCallKeywords
W1129 12:23:37.121073 21938 init.cc:205] @ 0x557d19d26585 _PyEval_EvalFrameDefault
W1129 12:23:37.121629 21938 init.cc:205] @ 0x557d19cbce7b _PyFunction_FastCallKeywords
W1129 12:23:37.122239 21938 init.cc:205] @ 0x557d19d29b29 _PyEval_EvalFrameDefault
W1129 12:23:37.122547 21938 init.cc:205] @ 0x557d19cc4fb4 gen_send_ex
W1129 12:23:37.123116 21938 init.cc:205] @ 0x557d19c6c164 _PyList_Extend
W1129 12:23:37.123739 21938 init.cc:205] @ 0x557d19c6c3f2 PySequence_List
W1129 12:23:37.124413 21938 init.cc:205] @ 0x557d19c6c464 PySequence_Fast
W1129 12:23:37.124961 21938 init.cc:205] @ 0x557d19c6c4a9 PyUnicode_Join
W1129 12:23:37.125635 21938 init.cc:205] @ 0x557d19cbd72d _PyMethodDef_RawFastCallKeywords
W1129 12:23:37.126222 21938 init.cc:205] @ 0x557d19cc47af _PyMethodDescr_FastCallKeywords
W1129 12:23:37.126834 21938 init.cc:205] @ 0x557d19d29c7c _PyEval_EvalFrameDefault
[1] + 21938 floating point exception (core dumped) nohup python3 -u main.py > log.txt 2>&1
这个问题解决了吗?出现了同样的问题
这个问题解决了吗?出现了同样的问题
有没有出来解释一下啥的
应该是数据里面有空数据 ' ' or '\n'
floating point exception
看起来像是哪里的数值问题