NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

[BUG] nvtabular is sensitive to the order of imports of nvtabular and horovod modules

Open vonodiripsa opened this issue 2 years ago • 12 comments

nvtabular is sensitive to the order of imports of nvtabular and horovod modules

We are trying to run nvtabular vs petastorm demo on Microsoft Synapse dev environment. During the demo installation @eordentlich ([email protected]) was reporting the problem. "Seems the notebook(s) is very sensitive to when import nvtabular takes place. There are some side effects of this import that we didn't get to fully test in the single node azure set up we had. The protobuf error is one of them. I think this is a bug in nvtabular, personally, but can be worked around in the notebook by placing the above import carefully. Databricks uses an older version of nvtabular which doesn't seem to have these problems (or maybe we got lucky)." There are some additional import order issues that have popped up here, involving horovod and numba. Users should not have to find error free import orders in these cases, nor is it obvious that import reordering can resolve the errors."

Steps/Code to reproduce bug We didn't try it outside of Synapse dev environment. In the environment you should open eordentlich/train-pytorch or train-tensorflow and move import nvtabular relative to horovod or numba until errors.

Expected behavior Users should not have to find error free import orders in these cases

Environment details (please complete the following information):

  • Azure NV-WWFO subscription, nvtabular-synapse-demo Synapse workspace
  • Method of NVTabular install: mix of conda and source installation. Part of the installation was performed by MSFT without any feedback

Additional context No additional context about the problem here.

vonodiripsa avatar May 14 '22 19:05 vonodiripsa

Did you finish any investigations? It is blocking us to move nvtabular in production environment.

vonodiripsa avatar Jun 03 '22 20:06 vonodiripsa

Once instance of this:

import horovod.tensorflow as hvd
from nvtabular.loader.tensorflow import KerasSequenceLoader
from nvtabular import Dataset
train_ds = KerasSequenceLoader(…

triggers the error free(): invalid pointer, but:

from nvtabular.loader.tensorflow import KerasSequenceLoader
import horovod.tensorflow as hvd
from nvtabular import Dataset
train_ds = KerasSequenceLoader(…

does not.

eordentlich avatar Jun 03 '22 21:06 eordentlich

I just tried replicating this on the nvcr.io/nvidia/merlin/merlin-tensorflow-training:22.05 and on the nvcr.io/nvidia/merlin/merlin-tensorflow-training:22.06 containers, and both example snippets worked for me without seeing the same error =(.

Do you have a container this happens on? I don't currently have access to Azure

benfred avatar Jul 07 '22 18:07 benfred

Unfortunately there is no way to use containers on Synapse. You should install it from source. Examples of installation scripts are under https://gitlab-master.nvidia.com/eordentlich/criteo-demo-local/-/tree/main/azure-vm/test-installation

vonodiripsa avatar Jul 07 '22 19:07 vonodiripsa

I tried out the install script here https://gitlab-master.nvidia.com/eordentlich/criteo-demo-local/-/blob/main/azure-vm/conda-env-setup.sh on my own local dev machine, as well as on top of the merlin-tensorflow:22.06 container - and still couldn't reproduce =(.

benfred avatar Jul 09 '22 00:07 benfred

@benfred I tried the docker images you mentioned going back to 22.02 and the problematic order is fine in all of these. One difference is the cuda run time/toolkit version which seems to be 11.6 even in 22.02 image, whereas it is 11.4 or below in the environments we've tested in synapse and elsewhere.

eordentlich avatar Jul 09 '22 01:07 eordentlich

And we could share the failing environment, but unfortunately there is no way to ssh to the VMs

vonodiripsa avatar Jul 09 '22 05:07 vonodiripsa

@benfred I was able to reproduce by installing the conda env (modified to cudatoolkit=11.6 or 11.4 and nvtabular=1.1.1 - also needed to loosen dask constraint) on top of nvcr.io/nvidia/cuda:11.6.0-devel-ubuntu20.04 (or a corresponding cuda 11.4 image) .

eordentlich avatar Jul 12 '22 18:07 eordentlich

@eordentlich - thanks for the tip about the container, I've managed to replicate this now

benfred avatar Jul 25 '22 19:07 benfred

py-spy shows the python stack trace looking like

Process 808419: /root/miniconda3/envs/criteo_demo/bin/python
Python v3.8.13 (/root/miniconda3/envs/criteo_demo/bin/python3.8)

Thread 0x7F20D12F4740 (active): "MainThread"
    __call__ (llvmlite/binding/ffi.py:151)
    get_process_triple (llvmlite/binding/targets.py:17)
    _create_empty_module (numba/core/codegen.py:1202)
    __init__ (numba/core/codegen.py:1169)
    init (numba/core/cpu.py:50)
    _acquire_compile_lock (numba/core/compiler_lock.py:35)
    __init__ (numba/core/base.py:262)
    __init__ (numba/core/cpu.py:41)
    _toplevel_target_context (numba/core/registry.py:31)
    __get__ (functools.py:967)
    target_context (numba/core/registry.py:47)
    __init__ (numba/core/dispatcher.py:824)
    wrapper (numba/core/decorators.py:208)
    <module> (nvtabular/ops/column_similarity.py:201)
    _call_with_frames_removed (<frozen importlib._bootstrap>:219)
    exec_module (<frozen importlib._bootstrap_external>:843)
    _load_unlocked (<frozen importlib._bootstrap>:671)
    _find_and_load_unlocked (<frozen importlib._bootstrap>:975)
    _find_and_load (<frozen importlib._bootstrap>:991)
    <module> (nvtabular/ops/__init__.py:31)
    _call_with_frames_removed (<frozen importlib._bootstrap>:219)
    exec_module (<frozen importlib._bootstrap_external>:843)
    _load_unlocked (<frozen importlib._bootstrap>:671)
    _find_and_load_unlocked (<frozen importlib._bootstrap>:975)
    _find_and_load (<frozen importlib._bootstrap>:991)
    <module> (nvtabular/workflow/node.py:17)
    _call_with_frames_removed (<frozen importlib._bootstrap>:219)
    exec_module (<frozen importlib._bootstrap_external>:843)
    _load_unlocked (<frozen importlib._bootstrap>:671)
    _find_and_load_unlocked (<frozen importlib._bootstrap>:975)
    _find_and_load (<frozen importlib._bootstrap>:991)
    <module> (nvtabular/workflow/__init__.py:18)
    _call_with_frames_removed (<frozen importlib._bootstrap>:219)
    exec_module (<frozen importlib._bootstrap_external>:843)
    _load_unlocked (<frozen importlib._bootstrap>:671)
    _find_and_load_unlocked (<frozen importlib._bootstrap>:975)
    _find_and_load (<frozen importlib._bootstrap>:991)
    _call_with_frames_removed (<frozen importlib._bootstrap>:219)
    _handle_fromlist (<frozen importlib._bootstrap>:1042)
    <module> (nvtabular/__init__.py:25)
    _call_with_frames_removed (<frozen importlib._bootstrap>:219)
    exec_module (<frozen importlib._bootstrap_external>:843)
    _load_unlocked (<frozen importlib._bootstrap>:671)
    _find_and_load_unlocked (<frozen importlib._bootstrap>:975)
    _find_and_load (<frozen importlib._bootstrap>:991)
    _call_with_frames_removed (<frozen importlib._bootstrap>:219)
    _find_and_load_unlocked (<frozen importlib._bootstrap>:961)
    _find_and_load (<frozen importlib._bootstrap>:991)
    _call_with_frames_removed (<frozen importlib._bootstrap>:219)
    _find_and_load_unlocked (<frozen importlib._bootstrap>:961)
    _find_and_load (<frozen importlib._bootstrap>:991)
    <module> (<stdin>:1)
  • it looks like it's segfaulting in numba JIT compiling this function : https://github.com/NVIDIA-Merlin/NVTabular/blob/32ed5fa123f827dc8b6981a11f1c00c3836a5759/nvtabular/ops/column_similarity.py#L200-L209

This simple code snippet also reproduces the problem for me - with only involving numba/horovod here:

import horovod.tensorflow as hvd
import numba

@numba.njit(parallel=True)
def add_vec(a, b, output):
    for i in numba.prange(len(a)):
        output[i] = a[i] + b[i]

benfred avatar Jul 25 '22 20:07 benfred

@wence- came up with a simpler reproducer that only involves llvmlite:

import horovod.tensorflow
import llvmlite.binding
llvmlite.binding.get_process_triple() 

benfred avatar Jul 26 '22 16:07 benfred

@eordentlich , @sohn21c , a reproducer is available above that doesn't involve NVT. @benfred , please post the link to the relevant slack conversation

viswa-nvidia avatar Aug 15 '22 16:08 viswa-nvidia

Since this isn't a part of our stack I'm closing. @vonodiripsa were you able to resolve this through llvm and horovod?

EvenOldridge avatar Oct 17 '22 23:10 EvenOldridge