blazingsql icon indicating copy to clipboard operation
blazingsql copied to clipboard

[BUG] `HDFS list directory failed` when running single node multi-GPU setup

Open lucharo opened this issue 4 years ago • 5 comments

Describe the bug I am trying to set up a single node, multi GPU notebook following the documentation but I get the following error:

HDFS list directory failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port. Filesystem HDFS=>hdfs.driver.type:LIBHDFS|hdfs.host:adress.company.com|hdfs.kerberos.ticket:/tmp/krb5cc_132855|hdfs.port:8020|hdfs.user:username with dask worker
tcp://127.0.0.1:41777

My environment only allows multipGPU notebooks via papermill hence I am testing this with a single GPU.

Steps/Code to reproduce bug

  • Initialise Blazing environment, declare env variables, etc
  • Run the following in a jupyter or ipython session
num_gpus = !nvidia-smi --list-gpus 2>/dev/null | wc -l
num_gpus = int(num_gpus[0])
print(f'Using {num_gpus} GPU(s)')
from blazingsql import BlazingContext
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)
bc = BlazingContext(dask_client = client, pool = True, initial_pool_size = num_gpus*32*10**9)
# or bc = BlazingContext(dask_client = client, network_interface = 'eth0', pool = True, initial_pool_size = num_gpus*32*10**9)
# or bc = BlazingContext(dask_client = client, network_interface = 'lo', pool = True, initial_pool_size = num_gpus*32*10**9)
## all above return same error
location = 'hdfs://server/company/user/username/schema.db/table/'

bc.hdfs('server',
        host='adress.company.com',
        port = 8020,
        user='username',
        kerb_ticket="/tmp/krb5cc_132855")
  • Then I get the error I posted above
  • Extra, the value of my cluster and client variables:
(LocalCUDACluster(b0f8beea, 'tcp://127.0.0.1:36006', workers=1, threads=1, memory=32.21 GB),
 <Client: 'tcp://127.0.0.1:36006' processes=1 threads=1, memory=32.21 GB>)

Expected behavior BlazingContext to be ready as per the documentation

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of BlazingSQL install: conda

Environment details conda list output below:

Expand/Collapse:
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
abseil-cpp                20200225.2           he1b5a44_2    conda-forge
alsa-lib                  1.2.3                h516909a_0    conda-forge
arrow-cpp                 1.0.1           py37h2318771_14_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
aws-c-common              0.4.59               h36c2ea0_1    conda-forge
aws-c-event-stream        0.1.6                had2084c_6    conda-forge
aws-checksums             0.1.10               h4e93380_0    conda-forge
aws-sdk-cpp               1.8.63               h9b98462_0    conda-forge
backcall                  0.2.0                    pypi_0    pypi
blazingsql                0.17.0                   pypi_0    pypi
bokeh                     2.2.3            py37hc8dfbb8_0    conda-forge
boost-cpp                 1.72.0               h9d3c048_4    conda-forge
brotli                    1.0.9                h9c3ff4c_4    conda-forge
brotlipy                  0.7.0           py37hb5d75c8_1001    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.17.1               h36c2ea0_0    conda-forge
ca-certificates           2021.1.19            h06a4308_0    anaconda-main-remote
cairo                     1.16.0            h7979940_1007    conda-forge
certifi                   2020.12.5        py37h89c1867_1    conda-forge
cffi                      1.14.4           py37hc58025e_1    conda-forge
chardet                   4.0.0            py37h89c1867_1    conda-forge
click                     7.1.2              pyh9f0ad1d_0    conda-forge
cloudpickle               1.6.0                      py_0    conda-forge
conda                     4.9.2            py37h89c1867_0    conda-forge
conda-package-handling    1.7.2            py37hb5d75c8_0    conda-forge
cryptography              3.3.1            py37h7f0c10b_1    conda-forge
cudatoolkit               10.1.243             h6bb024c_0    nvidia-remote
cudf                      0.17.0          cuda_10.1_py37_gf56ef850e6_0    rapidsai-remote
cudnn                     7.6.5.32             hc0a50b0_1    conda-forge
cupy                      8.4.0            py37hb9ab7da_1    conda-forge
cutensor                  1.2.2.5              h8b44402_2    conda-forge
cyrus-sasl                2.1.27               h3274739_1    conda-forge
cytoolz                   0.11.0           py37h5e8e339_3    conda-forge
dask                      2021.2.0           pyhd8ed1ab_0    conda-forge
dask-core                 2021.2.0           pyhd8ed1ab_0    conda-forge
dask-cuda                 0.17.0                   py37_0    rapidsai-remote
dask-cudf                 0.17.0          py37_gf56ef850e6_0    rapidsai-remote
decorator                 4.4.2                    pypi_0    pypi
distributed               2021.2.0         py37h89c1867_0    conda-forge
dlpack                    0.3                  he1b5a44_1    conda-forge
fastavro                  1.3.1            py37h5e8e339_0    conda-forge
fastrlock                 0.5              py37hcd2ae1e_2    conda-forge
fontconfig                2.13.1            hba837de_1004    conda-forge
freetype                  2.10.4               h0708190_1    conda-forge
fsspec                    0.8.5              pyhd8ed1ab_0    conda-forge
future                    0.18.2           py37h89c1867_3    conda-forge
gettext                   0.19.8.1          h0b5b191_1005    conda-forge
gflags                    2.2.2             he1b5a44_1004    conda-forge
giflib                    5.2.1                h516909a_2    conda-forge
glog                      0.4.0                h49b9bf7_3    conda-forge
google-cloud-cpp          1.16.0               he4a878c_2    conda-forge
google-cloud-cpp-common   0.25.0               he83eced_7    conda-forge
googleapis-cpp            0.10.0               h6b1abdc_4    conda-forge
graphite2                 1.3.14               h23475e2_0    anaconda-main-remote
grpc-cpp                  1.32.0               h7997a97_1    conda-forge
gtest                     1.10.0               h4bd325d_7    conda-forge
harfbuzz                  2.7.4                h5cf4720_0    conda-forge
heapdict                  1.0.1                      py_0    conda-forge
icu                       68.1                 h58526e2_0    conda-forge
idna                      2.10               pyh9f0ad1d_0    conda-forge
ipykernel                 5.4.3                    pypi_0    pypi
ipython                   7.20.0                   pypi_0    pypi
ipython-genutils          0.2.0                    pypi_0    pypi
jedi                      0.18.0                   pypi_0    pypi
jinja2                    2.11.3             pyh44b312d_0    conda-forge
jpeg                      9d                   h516909a_0    conda-forge
jpype1                    1.2.1            py37h2527ec5_0    conda-forge
jupyter-client            6.1.11                   pypi_0    pypi
jupyter-core              4.7.1                    pypi_0    pypi
krb5                      1.17.2               h926e7f8_0    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libarchive                3.5.1                h899b81a_0    conda-forge
libblas                   3.9.0                8_openblas    conda-forge
libcblas                  3.9.0                8_openblas    conda-forge
libcrc32c                 1.1.1                he1b5a44_2    conda-forge
libcudf                   0.17.0          cuda10.1_gf56ef850e6_0    rapidsai-remote
libcurl                   7.71.1               hcdd3856_8    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.10               hcdb4288_3    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgfortran-ng            7.5.0               h14aa051_18    conda-forge
libgfortran4              7.5.0               h14aa051_18    conda-forge
libglib                   2.66.6               h1f3bc88_3    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
libhwloc                  2.3.0                h5e5b7d1_1    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
liblapack                 3.9.0                8_openblas    conda-forge
libllvm10                 10.0.1               he513fc3_3    conda-forge
libnghttp2                1.43.0               h812cca2_0    conda-forge
libntlm                   1.5                  h7b6447c_0    anaconda-main-remote
libopenblas               0.3.12          pthreads_hb3c22a3_1    conda-forge
libpng                    1.6.37               hed695b0_2    conda-forge
libprotobuf               3.13.0.1             h8b12597_0    conda-forge
librmm                    0.17.0          cuda10.1_gc4cc945_0    rapidsai-remote
libsodium                 1.0.18               h516909a_1    conda-forge
libsolv                   0.7.17               h780b84a_0    conda-forge
libssh2                   1.9.0                hab1572f_5    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
libthrift                 0.13.0               hbe8ec66_6    conda-forge
libtiff                   4.2.0                hdc55705_0    conda-forge
libutf8proc               2.6.1                h7f98852_0    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libwebp-base              1.2.0                h7f98852_0    conda-forge
libxcb                    1.14                 h7b6447c_0    anaconda-main-remote
libxml2                   2.9.10               h72842e0_3    conda-forge
llvmlite                  0.35.0           py37h9d7f4d0_1    conda-forge
locket                    0.2.1            py37h06a4308_1    anaconda-main-remote
lz4-c                     1.9.2                he1b5a44_3    conda-forge
lzo                       2.10              h516909a_1000    conda-forge
mamba                     0.7.12           py37h7f483ca_0    conda-forge
markupsafe                1.1.1            py37h5e8e339_3    conda-forge
msgpack-python            1.0.2            py37h2527ec5_1    conda-forge
nccl                      2.8.4.1              h8b44402_0    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
netifaces                 0.10.9          py37h8f50634_1003    conda-forge
numba                     0.52.0           py37hdc94413_0    conda-forge
numpy                     1.19.5           py37haa41c4c_1    conda-forge
nvtx                      0.2.3            py37h5e8e339_0    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openjdk                   11.0.8               hacce0ff_0    conda-forge
openssl                   1.1.1i               h7f98852_0    conda-forge
orc                       1.6.5                hd3605a7_0    conda-forge
packaging                 20.9               pyh44b312d_0    conda-forge
pandas                    1.1.5            py37hdc94413_0    conda-forge
parquet-cpp               1.5.1                         1    conda-forge
parso                     0.8.1                    pypi_0    pypi
partd                     1.1.0                      py_0    conda-forge
pcre                      8.44                 he1b5a44_0    conda-forge
pexpect                   4.8.0                    pypi_0    pypi
pickleshare               0.7.5                    pypi_0    pypi
pillow                    8.1.0            py37h4600e1f_2    conda-forge
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
pixman                    0.40.0               h36c2ea0_0    conda-forge
prompt-toolkit            3.0.16                   pypi_0    pypi
protobuf                  3.13.0.1         py37h3340039_1    conda-forge
psutil                    5.8.0            py37h5e8e339_1    conda-forge
ptyprocess                0.7.0                    pypi_0    pypi
pyarrow                   1.0.1           py37hbeecfa9_14_cuda    conda-forge
pycosat                   0.6.3           py37h5e8e339_1006    conda-forge
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pygments                  2.7.4                    pypi_0    pypi
pyhive                    0.6.3              pyhd3deb0d_0    conda-forge
pynvml                    8.0.4                      py_1    conda-forge
pyopenssl                 20.0.1             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pysocks                   1.7.1            py37h89c1867_3    conda-forge
python                    3.7.9           hffdb5ce_0_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
pytz                      2021.1             pyhd8ed1ab_0    conda-forge
pyyaml                    5.4.1            py37h5e8e339_0    conda-forge
pyzmq                     22.0.3                   pypi_0    pypi
re2                       2020.10.01           he1b5a44_0    conda-forge
readline                  8.1                  h27cfd23_0    anaconda-main-remote
reproc                    14.2.1               h36c2ea0_0    conda-forge
reproc-cpp                14.2.1               h58526e2_0    conda-forge
requests                  2.25.1             pyhd3deb0d_0    conda-forge
rmm                       0.17.0          cuda_10.1_py37_gc4cc945_0    rapidsai-remote
ruamel_yaml               0.15.87          py37h7b6447c_1    anaconda-main-remote
sasl                      0.2.1           py37h3340039_1002    conda-forge
setuptools                52.0.0           py37h06a4308_0    anaconda-main-remote
six                       1.15.0             pyh9f0ad1d_0    conda-forge
snappy                    1.1.8                he1b5a44_3    conda-forge
sortedcontainers          2.3.0              pyhd8ed1ab_0    conda-forge
spdlog                    1.7.0                hc9558a2_2    conda-forge
sqlalchemy                1.3.23           py37h5e8e339_0    conda-forge
sqlite                    3.34.0               h74cdb3f_0    conda-forge
tblib                     1.7.0                      py_0    anaconda-main-remote
thrift                    0.13.0           py37h3340039_2    conda-forge
thrift_sasl               0.4.2            py37h8f50634_0    conda-forge
tk                        8.6.10               hed695b0_1    conda-forge
toolz                     0.11.1                     py_0    conda-forge
tornado                   6.1              py37h5e8e339_1    conda-forge
tqdm                      4.56.1             pyhd8ed1ab_0    conda-forge
traitlets                 5.0.5                    pypi_0    pypi
typing_extensions         3.7.4.3                    py_0    conda-forge
ucx                       1.8.1+g6b29558       cuda10.1_0    rapidsai-remote
ucx-proc                  1.0.0                       gpu    rapidsai-remote
ucx-py                    0.17.0          py37_g6b29558_0    rapidsai-remote
urllib3                   1.26.3             pyhd8ed1ab_0    conda-forge
wcwidth                   0.2.5                    pypi_0    pypi
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xorg-fixesproto           5.0               h14c3975_1002    conda-forge
xorg-inputproto           2.3.2             h14c3975_1002    conda-forge
xorg-kbproto              1.0.7             h14c3975_1002    conda-forge
xorg-libice               1.0.10               h516909a_0    conda-forge
xorg-libsm                1.2.3             h84519dc_1000    conda-forge
xorg-libx11               1.6.12               h516909a_0    conda-forge
xorg-libxext              1.3.4                h516909a_0    conda-forge
xorg-libxfixes            5.0.3             h516909a_1004    conda-forge
xorg-libxi                1.7.10               h516909a_0    conda-forge
xorg-libxrender           0.9.10            h516909a_1002    conda-forge
xorg-libxtst              1.2.3             h516909a_1002    conda-forge
xorg-recordproto          1.14.2            h516909a_1002    conda-forge
xorg-renderproto          0.11.1            h14c3975_1002    conda-forge
xorg-xextproto            7.3.0             h14c3975_1002    conda-forge
xorg-xproto               7.0.31            h14c3975_1007    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yaml                      0.2.5                h516909a_0    conda-forge
zeromq                    4.3.3                h58526e2_3    conda-forge
zict                      2.0.0                      py_0    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.8                hdf46e1d_0    conda-forge
**Additional context** Add any other context about the problem here.

----For BlazingSQL Developers---- Suspected source of the issue Where and what are potential sources of the issue

Other design considerations What components of the engine could be affected by this?

lucharo avatar Feb 18 '21 17:02 lucharo

Does your above setup work, without using dask? meaning, just creating you BlazingContext as so: bc = BlazingContext()

wmalpica avatar Feb 19 '21 20:02 wmalpica

Hi @williamBlazing, yes, my setup works if I don't use dask_cuda

lucharo avatar Feb 19 '21 20:02 lucharo

So you are saying this works:

bc = BlazingContext()

bc.hdfs('server',
        host='adress.company.com',
        port = 8020,
        user='username',
        kerb_ticket="/tmp/krb5cc_132855")

but this does not:

bc = BlazingContext(dask_client = client, pool = True, initial_pool_size = num_gpus*32*10**9)
bc.hdfs('server',
        host='adress.company.com',
        port = 8020,
        user='username',
        kerb_ticket="/tmp/krb5cc_132855")

And this is on a local computer with one GPU, not via papermill, correct?

That would be rather strange. The difference between the two, is that on the second one, its the dask worker's process that is trying to connect to HDFS, instead of the main python process. In which case, something about the environment is different for the dask worker.

wmalpica avatar Feb 19 '21 21:02 wmalpica

@williamBlazing That is exactly it!

I've also thought is something is off with the dask_client's connection (tcp:127.0.0.1:PORT)

I am not really familiar with dask so I cannot do much debugging, though please let me know if you have any ideas and I'll try them out as soon as I can

lucharo avatar Feb 19 '21 21:02 lucharo

Hi @lucharo thanks for this report! I think I know the reason, the kerberos file/ticket /tmp/krb5cc_132855 needs to be located in all the nodes where dask is running. For instance:

If you have a dask worker on machine A, then we need that the machine A has /tmp/krb5cc_132855 If you have a dask worker on machine B, then we need that the machine B has /tmp/krb5cc_132855 ... and so on

When you run without dask then you don't have to worry about this, but the execution will be only in a single machine. Let us know if this helps! pd I'm cc @rommelDB too ;)

aucahuasi avatar Apr 16 '21 23:04 aucahuasi