cylon
cylon copied to clipboard
Cylon container fails
Hello @nirandaperera and Cylon team,
I was testing Cylon container with Kubernetes on AWS. I have a multi-node setup of MPI environment on the cluster.
I tested Cylon with 1 and 2 nodes (each node has 128 cores and 16GB of memory per core (total per node is 2048 GB)) both runs worked just fine when executing join
operation with ~35M
rows using the following script https://github.com/cylondata/cylon/blob/main/summit/scripts/cylon_scaling.py.
The command line that I used:
mpirun -n 256 cylon_scaling.py -s w -n 35000000
I repeated the same setup but this time with 3 or 4 nodes:
mpirun -n 384 cylon_scaling.py -s w -n 35000000
mpirun -n 512 cylon_scaling.py -s w -n 35000000
And I started getting the following error:
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],135][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) fail[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98)
[cylon-join-worker-1][[60663,1],146][btl_tcp_endpoint.c:730:mca_btl_tcp_endpoint_start_connect] bind on local address (192.168.99.12:0) failed: Address already in use (98) [cylon-join-worker-1][[60663,1],146][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(258) failed: Bad file descriptor (9)
[cylon-join-worker-1:17259] *** Process received signal ***
[cylon-join-worker-1:17259] Signal: Segmentation fault (11)
[cylon-join-worker-1:17259] Signal code: Address not mapped (1)
[cylon-join-worker-1:17259] Failing at address: (nil)
[cylon-join-worker-1:17259] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fe751c04090]
[cylon-join-worker-1:17259] *** End of error message ***
[cylon-join-worker-1:17186] *** Process received signal ***
[cylon-join-worker-1:17186] Signal: Segmentation fault (11)
[cylon-join-worker-1:17186] Signal code: Address not mapped (1)
[cylon-join-worker-1:17186] Failing at address: 0x18
[cylon-join-worker-1:17186] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f50e0825090]
[cylon-join-worker-1:17186] [ 1] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_tcp.so(mca_btl_tcp_endpoint_send+0x609)[0x7f50dbcdbfa9]
[cylon-join-worker-1:17186] [ 2] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x1a1)[0x7f50db6d8bc1]
[cylon-join-worker-1:17186] [ 3] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so(mca_pml_ob1_isend+0x482)[0x7f50db6ca3a2]
[cylon-join-worker-1:17186] [ 4] /lib/x86_64-linux-gnu/libmpi.so.40(MPI_Isend+0x12d)[0x7f50dd28893d]
[cylon-join-worker-1:17186] [ 5] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon10MPIChannel13progressSendsEv+0x15a)[0x7f5025e8703a]
[cylon-join-worker-1:17186] [ 6] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon8AllToAll10isCompleteEv+0x233)[0x7f5025e8d103]
[cylon-join-worker-1:17186] [ 7] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon13ArrowAllToAll10isCompleteEv+0x76a)[0x7f5025bb069a]
[cylon-join-worker-1:17186] [ 8] /cylon/build/lib/libcylon.so.0.6.0(+0x4ed43c)[0x7f5025ecc43c]
[cylon-join-worker-1:17186] [ 9] /cylon/build/lib/libcylon.so.0.6.0(+0x4ee323)[0x7f5025ecd323]
[cylon-join-worker-1:17186] [10] /cylon/build/lib/libcylon.so.0.6.0(_ZN5cylon15DistributedJoinERKSt10shared_ptrINS_5TableEES4_RKNS_4join6config10JoinConfigERS2_+0x8a)[0x7f5025ece00a]
[cylon-join-worker-1:17186] [11] /cylon/ENV/lib/python3.8/site-packages/pycylon-0+untagged.1302.g44a27a6-py3.8-linux-x86_64.egg/pycylon/data/table.cpython-38-x86_64-linux-gnu.so(+0x75c02)[0x7f50db67ec02]
[cylon-join-worker-1:17186] [12] /cylon/ENV/bin/python3(PyCFunction_Call+0x59)[0x5f6939]
[cylon-join-worker-1:17186] [13] /cylon/ENV/bin/python3(_PyObject_MakeTpCall+0x296)[0x5f7506]
[cylon-join-worker-1:17186] [14] /cylon/ENV/bin/python3(_PyEval_EvalFrameDefault+0x6259)[0x571019]
[cylon-join-workerl address (192.168.99.12:0) failed: Address already in use (98)
cylon-join-launcher:00001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cylon-join-launcher:00001] 24 more processes have sent help message help-mpi-btl-tcp.txt / socket flag fail
[cylon-join-launcher:00001] 93 more processes have sent help message help-mpi-btl-tcp.txt / peer hung up
[cylon-join-launcher:00001] 11 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
Any help here would be appreciate it.
This looks like an execution issue. There are also subtleties related to docker usage and port mapping. You might want to try host networking and ensure there are not processes already executing/idle on the various nodes. For ECS, it has been necessary to specifically map ports to avoid this sort of thing.
@mstaylor Can you elaborate more, please, on ECS, it has been necessary to specifically map ports to avoid this sort of thing.
It would be great If you have an example of how to do so. Thanks.
@mstaylor, a gentle reminder about the comment above.
@AymenFJA - your issue is here: [cylon-join-workerl address (192.168.99.12:0) failed: Address already in use (98)
For my research experiments, I use UCX/UXX/Redis which is a bit different. For OpenMPI, you might consider using the following approach: https://github.com/everpeace/kube-openmpi. If you switch to ECS, you can generate a task that includes port mapping. Here's an example from my ECS task mapping:
"family": "cylon-ucc-ucx-redis-ec2-4_26_9100000-8Node-task", "containerDefinitions": [ { "name": "redisUCSUCX", "image": "448324707516.dkr.ecr.us-east-1.amazonaws.com/cylon-ucc-ucx-redis:latest", "cpu": 4096, "memory": 26624, "portMappings": [ { "name": "redisucsucx-18-tcp", "containerPort": 18, "hostPort": 18, "protocol": "tcp" }, { "name": "redisucsucx-41768-tcp", "containerPort": 41768, "hostPort": 41768, "protocol": "tcp" },...
The issue is your are running on pods with addresses already in use (hence the error logged). What does your hosts file look like?
@AymenFJA - did you use our docker image or build an image based on updates in main?
Thanks, @mstaylor, for your response. Can we have a 1-1 meeting to discuss it? It would be great to do that. If you agree, I can ping you on Slack and take it from there.
@AymenFJA - that sounds great.
@mstaylor I pinged you on slack/cylondata.