[BUG] `HDFS list directory failed` when running single node multi-GPU setup
Describe the bug I am trying to set up a single node, multi GPU notebook following the documentation but I get the following error:
HDFS list directory failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port. Filesystem HDFS=>hdfs.driver.type:LIBHDFS||hdfs.kerberos.ticket:/tmp/krb5cc_132855|hdfs.port:8020|hdfs.user:username with dask worker
My environment only allows multipGPU notebooks via papermill hence I am testing this with a single GPU.
Steps/Code to reproduce bug
- Initialise Blazing environment, declare env variables, etc
- Run the following in a jupyter or ipython session
num_gpus = !nvidia-smi --list-gpus 2>/dev/null | wc -l
num_gpus = int(num_gpus[0])
print(f'Using {num_gpus} GPU(s)')
from blazingsql import BlazingContext
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
bc = BlazingContext(dask_client = client, pool = True, initial_pool_size = num_gpus*32*10**9)
# or bc = BlazingContext(dask_client = client, network_interface = 'eth0', pool = True, initial_pool_size = num_gpus*32*10**9)
# or bc = BlazingContext(dask_client = client, network_interface = 'lo', pool = True, initial_pool_size = num_gpus*32*10**9)
## all above return same error
location = 'hdfs://server/company/user/username/schema.db/table/'
port = 8020,
- Then I get the error I posted above
- Extra, the value of my
(LocalCUDACluster(b0f8beea, 'tcp://', workers=1, threads=1, memory=32.21 GB),
<Client: 'tcp://' processes=1 threads=1, memory=32.21 GB>)
Expected behavior
to be ready as per the documentation
Environment overview (please complete the following information)
- Environment location: Bare-metal
- Method of BlazingSQL install: conda
Environment details
conda list
output below:
----For BlazingSQL Developers---- Suspected source of the issue Where and what are potential sources of the issue
Other design considerations What components of the engine could be affected by this?
Does your above setup work, without using dask? meaning, just creating you BlazingContext as so:
bc = BlazingContext()
Hi @williamBlazing, yes, my setup works if I don't use dask_cuda
So you are saying this works:
bc = BlazingContext()
port = 8020,
but this does not:
bc = BlazingContext(dask_client = client, pool = True, initial_pool_size = num_gpus*32*10**9)
port = 8020,
And this is on a local computer with one GPU, not via papermill, correct?
That would be rather strange. The difference between the two, is that on the second one, its the dask worker's process that is trying to connect to HDFS, instead of the main python process. In which case, something about the environment is different for the dask worker.
@williamBlazing That is exactly it!
I've also thought is something is off with the dask_client's connection (tcp:
I am not really familiar with dask so I cannot do much debugging, though please let me know if you have any ideas and I'll try them out as soon as I can
Hi @lucharo thanks for this report!
I think I know the reason, the kerberos file/ticket /tmp/krb5cc_132855
needs to be located in all the nodes where dask is running. For instance:
If you have a dask worker on machine A, then we need that the machine A has /tmp/krb5cc_132855
If you have a dask worker on machine B, then we need that the machine B has /tmp/krb5cc_132855
... and so on
When you run without dask then you don't have to worry about this, but the execution will be only in a single machine. Let us know if this helps! pd I'm cc @rommelDB too ;)