Clinical-Trial-Parser Training the NER on MacOS results in segfault

On MacOS Mojave and Catalina, training the NER models results in a segfault during the training of epoch 1. I'm using the command as advertised pytext train < src/resources/config/ner.json, from the root of the repo. The training is running in CPU mode only.

I created a dockerfile to run the same operation, and that one doesn't fail.

Below is the backtrace of the core dump, in case it helps (kept only the frames involving other libraries than python itself, there's an extra 81 frames)

* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x000000011da7bcf3 libtorch.dylib`void c10::function_ref<void (char**, long long const*, long long)>::callback_fn<void at::native::(anonymous namespace)::cpu_kernel_vec<at::native::(anonymous namespace)::div_kernel(at::Ten
sorIterator&)::$_7::operator()() const::'lambda0'()::operator()() const::'lambda'(float, float), at::native::(anonymous namespace)::div_kernel(at::TensorIterator&)::$_7::operator()() const::'lambda0'()::operator()() const::'lambda'(at::
vec256::(anonymous namespace)::Vec256<float>, at::vec256::(anonymous namespace)::Vec256<float>)>(at::TensorIterator&, at::native::(anonymous namespace)::div_kernel(at::TensorIterator&)::$_7::operator()() const::'lambda0'()::operator()()
 const::'lambda'(float, float), at::native::(anonymous namespace)::div_kernel(at::TensorIterator&)::$_7::operator()() const::'lambda0'()::operator()() const::'lambda'(at::vec256::(anonymous namespace)::Vec256<float>, at::vec256::(anonym
ous namespace)::Vec256<float>))::'lambda'(char**, long long const*, long long)>(long, char**, long long const*, long long) + 499
    frame #1: 0x000000011ce89925 libtorch.dylib`void c10::function_ref<void (char**, long long const*, long long, long long)>::callback_fn<at::TensorIterator::for_each(c10::function_ref<void (char**, long long const*, long long)>)::$_5>
(long, char**, long long const*, long long, long long) + 373
    frame #2: 0x000000011ce80822 libtorch.dylib`at::TensorIterator::serial_for_each(c10::function_ref<void (char**, long long const*, long long, long long)>, at::Range) const + 370
    frame #3: 0x000000011ce805be libtorch.dylib`at::TensorIterator::for_each(c10::function_ref<void (char**, long long const*, long long)>) + 222
    frame #4: 0x000000011da62ec8 libtorch.dylib`at::native::(anonymous namespace)::div_kernel(at::TensorIterator&) + 568
    frame #5: 0x000000011cb77bbc libtorch.dylib`at::native::div(at::Tensor const&, at::Tensor const&) + 124
    frame #6: 0x000000011d096cb0 libtorch.dylib`at::CPUType::(anonymous namespace)::div(at::Tensor const&, at::Tensor const&) + 112
    frame #7: 0x000000011d0b48d5 libtorch.dylib`c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<at::Tensor (*)(at::Tensor const&, at::Tensor const&), at::Tensor, c10::guts::typelist::typelist<at::Tensor
const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor const&) + 21
    frame #8: 0x000000011cb400d9 libtorch.dylib`at::Tensor c10::KernelFunction::callUnboxed<at::Tensor, at::Tensor const&, at::Tensor const&>(at::Tensor const&, at::Tensor const&) const + 57
    frame #9: 0x000000011cb40023 libtorch.dylib`std::__1::result_of<at::Tensor (ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std
::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)>::type c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocat
or<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > >::read<at::Tensor c10::Dispatcher::doCallUnboxed<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTy
peId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, at::Tensor const&) const::'
lambda'(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)>(at::Ten
sor&&) const + 179
    frame #10: 0x000000011cb3ff08 libtorch.dylib`std::__1::result_of<at::Tensor (c10::DispatchTable const&)>::type c10::LeftRight<c10::DispatchTable>::read<at::Tensor c10::Dispatcher::callUnboxed<at::Tensor, at::Tensor const&, at::Tenso
r const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::'lambda'(c10::DispatchTable const&)>(at::Tensor&&) const + 88
    frame #11: 0x000000011f4708bb libtorch.dylib`torch::autograd::VariableType::(anonymous namespace)::div(at::Tensor const&, at::Tensor const&) + 2539
    frame #12: 0x000000011d0b48d5 libtorch.dylib`c10::detail::wrap_kernel_functor_unboxed_<c10::detail::WrapRuntimeKernelFunctor_<at::Tensor (*)(at::Tensor const&, at::Tensor const&), at::Tensor, c10::guts::typelist::typelist<at::Tensor
 const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, at::Tensor const&, at::Tensor const&) + 21
    frame #13: 0x0000000118894148 libtorch_python.dylib`at::Tensor c10::KernelFunction::callUnboxed<at::Tensor, at::Tensor const&, at::Tensor const&>(at::Tensor const&, at::Tensor const&) const + 216
    frame #14: 0x0000000118893ff3 libtorch_python.dylib`std::__1::result_of<at::Tensor (ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::alloc
ator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)>::type c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1:
:allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > >::read<at::Tensor c10::Dispatcher::doCallUnboxed<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::
TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, at::Tensor const&)
const::'lambda'(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::__1::hash<c10::TensorTypeId>, std::__1::equal_to<c10::TensorTypeId>, std::__1::allocator<std::__1::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)>
(at::Tensor&&) const + 179
    frame #15: 0x0000000118893ed8 libtorch_python.dylib`std::__1::result_of<at::Tensor (c10::DispatchTable const&)>::type c10::LeftRight<c10::DispatchTable>::read<at::Tensor c10::Dispatcher::callUnboxed<at::Tensor, at::Tensor const&, at
::Tensor const&>(c10::OperatorHandle const&, at::Tensor const&, at::Tensor const&) const::'lambda'(c10::DispatchTable const&)>(at::Tensor&&) const + 88
    frame #16: 0x00000001188a5d19 libtorch_python.dylib`at::Tensor::div(at::Tensor const&) const + 329
    frame #17: 0x0000000118778bd8 libtorch_python.dylib`torch::autograd::THPVariable_div(_object*, _object*, _object*) + 568
    frame #18: 0x00000001186e60c9 libtorch_python.dylib`_object* torch::autograd::TypeError_to_NotImplemented_<&(torch::autograd::THPVariable_div(_object*, _object*, _object*))>(_object*, _object*, _object*) + 9
    frame #19: 0x000000010fe4caa0 python`_PyMethodDef_RawFastCallDict + 576

May 11 '20 17:05 guneemwelloeux

Thank you for pointing out the segmentation fault. I'm not sure why we get it. The following should work on macOS:

Install Anaconda3 from https://docs.anaconda.com/anaconda/install/mac-os/
Install PyText: pip install pytext-nlp
Run pytext train < src/resources/config/ner.json and the segmentation fault should go away

May 11 '20 20:05 salkola

I opened an issue with the PyText team.

The segmentation fault may be related to the version mismatch between pytext and onnx/torch.

If Anaconda3 by itself does not help, you may have to upgrade onnx and torch:

conda install onnx -c conda-forge
conda install pytorch torchvision -c pytorch

May 21 '20 23:05 salkola

Same issue on Ubuntu 20.04 LTS, Python 3.8.3

Aug 27 '20 19:08 wbollock

Clinical-Trial-Parser Clinical-Trial-Parser copied to clipboard

Training the NER on MacOS results in segfault

Clinical-Trial-Parser
Clinical-Trial-Parser copied to clipboard