NVTabular
NVTabular copied to clipboard
[BUG] nvtabular is sensitive to the order of imports of nvtabular and horovod modules
nvtabular is sensitive to the order of imports of nvtabular and horovod modules
We are trying to run nvtabular vs petastorm demo on Microsoft Synapse dev environment. During the demo installation @eordentlich ([email protected]) was reporting the problem. "Seems the notebook(s) is very sensitive to when import nvtabular takes place. There are some side effects of this import that we didn't get to fully test in the single node azure set up we had. The protobuf error is one of them. I think this is a bug in nvtabular, personally, but can be worked around in the notebook by placing the above import carefully. Databricks uses an older version of nvtabular which doesn't seem to have these problems (or maybe we got lucky)." There are some additional import order issues that have popped up here, involving horovod and numba. Users should not have to find error free import orders in these cases, nor is it obvious that import reordering can resolve the errors."
Steps/Code to reproduce bug We didn't try it outside of Synapse dev environment. In the environment you should open eordentlich/train-pytorch or train-tensorflow and move import nvtabular relative to horovod or numba until errors.
Expected behavior Users should not have to find error free import orders in these cases
Environment details (please complete the following information):
- Azure NV-WWFO subscription, nvtabular-synapse-demo Synapse workspace
- Method of NVTabular install: mix of conda and source installation. Part of the installation was performed by MSFT without any feedback
Additional context No additional context about the problem here.
Did you finish any investigations? It is blocking us to move nvtabular in production environment.
Once instance of this:
import horovod.tensorflow as hvd
from nvtabular.loader.tensorflow import KerasSequenceLoader
from nvtabular import Dataset
train_ds = KerasSequenceLoader(…
triggers the error free(): invalid pointer
, but:
from nvtabular.loader.tensorflow import KerasSequenceLoader
import horovod.tensorflow as hvd
from nvtabular import Dataset
train_ds = KerasSequenceLoader(…
does not.
I just tried replicating this on the nvcr.io/nvidia/merlin/merlin-tensorflow-training:22.05
and on the nvcr.io/nvidia/merlin/merlin-tensorflow-training:22.06
containers, and both example snippets worked for me without seeing the same error =(.
Do you have a container this happens on? I don't currently have access to Azure
Unfortunately there is no way to use containers on Synapse. You should install it from source. Examples of installation scripts are under https://gitlab-master.nvidia.com/eordentlich/criteo-demo-local/-/tree/main/azure-vm/test-installation
I tried out the install script here https://gitlab-master.nvidia.com/eordentlich/criteo-demo-local/-/blob/main/azure-vm/conda-env-setup.sh on my own local dev machine, as well as on top of the merlin-tensorflow:22.06 container - and still couldn't reproduce =(.
@benfred I tried the docker images you mentioned going back to 22.02 and the problematic order is fine in all of these. One difference is the cuda run time/toolkit version which seems to be 11.6 even in 22.02 image, whereas it is 11.4 or below in the environments we've tested in synapse and elsewhere.
And we could share the failing environment, but unfortunately there is no way to ssh to the VMs
@benfred I was able to reproduce by installing the conda env (modified to cudatoolkit=11.6 or 11.4 and nvtabular=1.1.1 - also needed to loosen dask constraint) on top of nvcr.io/nvidia/cuda:11.6.0-devel-ubuntu20.04
(or a corresponding cuda 11.4 image) .
@eordentlich - thanks for the tip about the container, I've managed to replicate this now
py-spy shows the python stack trace looking like
Process 808419: /root/miniconda3/envs/criteo_demo/bin/python
Python v3.8.13 (/root/miniconda3/envs/criteo_demo/bin/python3.8)
Thread 0x7F20D12F4740 (active): "MainThread"
__call__ (llvmlite/binding/ffi.py:151)
get_process_triple (llvmlite/binding/targets.py:17)
_create_empty_module (numba/core/codegen.py:1202)
__init__ (numba/core/codegen.py:1169)
init (numba/core/cpu.py:50)
_acquire_compile_lock (numba/core/compiler_lock.py:35)
__init__ (numba/core/base.py:262)
__init__ (numba/core/cpu.py:41)
_toplevel_target_context (numba/core/registry.py:31)
__get__ (functools.py:967)
target_context (numba/core/registry.py:47)
__init__ (numba/core/dispatcher.py:824)
wrapper (numba/core/decorators.py:208)
<module> (nvtabular/ops/column_similarity.py:201)
_call_with_frames_removed (<frozen importlib._bootstrap>:219)
exec_module (<frozen importlib._bootstrap_external>:843)
_load_unlocked (<frozen importlib._bootstrap>:671)
_find_and_load_unlocked (<frozen importlib._bootstrap>:975)
_find_and_load (<frozen importlib._bootstrap>:991)
<module> (nvtabular/ops/__init__.py:31)
_call_with_frames_removed (<frozen importlib._bootstrap>:219)
exec_module (<frozen importlib._bootstrap_external>:843)
_load_unlocked (<frozen importlib._bootstrap>:671)
_find_and_load_unlocked (<frozen importlib._bootstrap>:975)
_find_and_load (<frozen importlib._bootstrap>:991)
<module> (nvtabular/workflow/node.py:17)
_call_with_frames_removed (<frozen importlib._bootstrap>:219)
exec_module (<frozen importlib._bootstrap_external>:843)
_load_unlocked (<frozen importlib._bootstrap>:671)
_find_and_load_unlocked (<frozen importlib._bootstrap>:975)
_find_and_load (<frozen importlib._bootstrap>:991)
<module> (nvtabular/workflow/__init__.py:18)
_call_with_frames_removed (<frozen importlib._bootstrap>:219)
exec_module (<frozen importlib._bootstrap_external>:843)
_load_unlocked (<frozen importlib._bootstrap>:671)
_find_and_load_unlocked (<frozen importlib._bootstrap>:975)
_find_and_load (<frozen importlib._bootstrap>:991)
_call_with_frames_removed (<frozen importlib._bootstrap>:219)
_handle_fromlist (<frozen importlib._bootstrap>:1042)
<module> (nvtabular/__init__.py:25)
_call_with_frames_removed (<frozen importlib._bootstrap>:219)
exec_module (<frozen importlib._bootstrap_external>:843)
_load_unlocked (<frozen importlib._bootstrap>:671)
_find_and_load_unlocked (<frozen importlib._bootstrap>:975)
_find_and_load (<frozen importlib._bootstrap>:991)
_call_with_frames_removed (<frozen importlib._bootstrap>:219)
_find_and_load_unlocked (<frozen importlib._bootstrap>:961)
_find_and_load (<frozen importlib._bootstrap>:991)
_call_with_frames_removed (<frozen importlib._bootstrap>:219)
_find_and_load_unlocked (<frozen importlib._bootstrap>:961)
_find_and_load (<frozen importlib._bootstrap>:991)
<module> (<stdin>:1)
- it looks like it's segfaulting in numba JIT compiling this function : https://github.com/NVIDIA-Merlin/NVTabular/blob/32ed5fa123f827dc8b6981a11f1c00c3836a5759/nvtabular/ops/column_similarity.py#L200-L209
This simple code snippet also reproduces the problem for me - with only involving numba/horovod here:
import horovod.tensorflow as hvd
import numba
@numba.njit(parallel=True)
def add_vec(a, b, output):
for i in numba.prange(len(a)):
output[i] = a[i] + b[i]
@wence- came up with a simpler reproducer that only involves llvmlite:
import horovod.tensorflow
import llvmlite.binding
llvmlite.binding.get_process_triple()
@eordentlich , @sohn21c , a reproducer is available above that doesn't involve NVT. @benfred , please post the link to the relevant slack conversation
Since this isn't a part of our stack I'm closing. @vonodiripsa were you able to resolve this through llvm and horovod?