tf-dlpack icon indicating copy to clipboard operation
tf-dlpack copied to clipboard

[BUG] Segmentation Fault when using tfdlpack.to_dlpack on tf.tensor

Open awthomp opened this issue 5 years ago • 9 comments

I've been experimenting with using tfdlpack to connect libraries using __cuda_array_interface__ to TensorFlow with tfdlpack and reach a segmentation fault when invoking to_dlpack with a TF tensor. See below for replication:

import cupy as cp
import tfdlpack

# CuPy - GPU Array (like NumPy!)
gpu_arr = cp.random.rand(10_000, 10_000)

# Use CuPy's built in `toDlpack` function to move to a DLPack capsule
dlpack_arr = gpu_arr.toDlpack()

# Use `tfdlpack` to migrate to TensorFlow
tf_tensor = tfdlpack.from_dlpack(dlpack_arr)

# Confirm TF tensor is on GPU
print(tf_tensor.device)

# Use `tfdlpack` to migrate back to CuPy; this yields a segmentation fault
dlpack_capsule = tfdlpack.to_dlpack(tf_tensor)

I'm using 1 GP100 isolated with the CUDA_VISIBLE_DEVICES environment variable.

awthomp avatar Jan 17 '20 17:01 awthomp

Confirmed this is a bug. I replaced cupy with torch and it also crashes.

import torch
from torch.utils import dlpack as th_dlpack
import tfdlpack

gpu_arr = torch.rand(10_000, 10_000).cuda()
print(gpu_arr)

dlpack_arr = th_dlpack.to_dlpack(gpu_arr)

# Use `tfdlpack` to migrate to TensorFlow
tf_tensor = tfdlpack.from_dlpack(dlpack_arr)

# Confirm TF tensor is on GPU
print(tf_tensor.device)

# Use `tfdlpack` to migrate back to CuPy; this yields a segmentation fault
dlpack_capsule = tfdlpack.to_dlpack(tf_tensor)

jermainewang avatar Jan 18 '20 08:01 jermainewang

What's your tensorflow version? I found the code works with tensorflow v2.1.0 but not v2.0.0.

jermainewang avatar Jan 18 '20 09:01 jermainewang

It works well on my machine. I'm using tensorflow 2.1.0

VoVAllen avatar Jan 18 '20 09:01 VoVAllen

What's your tensorflow version? I found the code works with tensorflow v2.1.0 but not v2.0.0.

Interesting. I was on TF 2.1.0 when submitting the bug report. I've included an Anaconda environment file below to ensure we're on the same page for SW dependencies:

name: tfdlpack
channels:
  - conda-forge
  - nvidia
  - pytorch
  - defaults
  - numba
dependencies:
  - python=3.7
  - numpy
  - cudatoolkit>=9.2,<10.2
  - numba
  - cupy>=6.2.0
  - pytorch
  - pip
  - pip:
      - tfdlpack-gpu

Just save this into a file named tfdlpack_conda.yml. Then run:

conda env create -f tfdlpack_conda.yml conda activate tfdlpack

My system contains 2 GP100s (Pascal P100) and 1 P2000 to drive graphics. I typically isolate GPU0 (P100) with export CUDA_VISIBLE_DEVICES=0.

awthomp avatar Jan 18 '20 13:01 awthomp

I'm also receiving the segfault with an NVIDIA T4. Here's a Google Colab notebook that you can run through. Perhaps pip install tfdlpack-gpu isn't pulling in all the expected/necessary dependencies?

https://colab.research.google.com/drive/18Z8bOCJ2Mr-jOD-vIbr6KAO1-KPUy_UM

awthomp avatar Jan 18 '20 16:01 awthomp

Thanks for your example. Actually I'm thinking of reorganize the whole project based on new tensorflow custom-op repo https://github.com/tensorflow/custom-op. As this is the official guide on how to distribute custom op. However I'm skeptical on whether I should make the project based on Bazel instead of CMake. I may need more time on thihs.

VoVAllen avatar Jan 18 '20 16:01 VoVAllen

Thanks, @VoVAllen and thanks for your hard and great work at enabling DLPack support with TensorFlow. Don't hesitate to let us know what you need help with.

awthomp avatar Jan 18 '20 16:01 awthomp

@awthomp I've updated the binary release and it now works in colab. Could you try it in your environment again?

However there's still bug in this release. It would happen when you create a capsule from tensorflow but not consuming it in another framework. I'm still investigating the solution.

VoVAllen avatar Jan 19 '20 15:01 VoVAllen

@VoVAllen. Wahoo! Works for me in both Colab on a T4 and on my local machine with a P100. Thanks for the quick fix!

awthomp avatar Jan 19 '20 16:01 awthomp