ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCT/TCP: Use SIOCGIFCONF ioctl when /sys/class/net is missing.

Open civodul opened this issue 5 years ago • 26 comments

What

This change provides alternative code that uses the SIOCGIFCONF ioctl to get the names of the available TCP network interfaces.

Why ?

In some cases such as isolated build environments (as found in GNU Guix), containers, or non-Linux based system, /sys is missing.

How ?

Using the old, portable SIOCGIFCONF ioctl.

It may be that the SIOCGIFCONF can in fact replace the /sys-based code since the information returned should be the same. WDYT?

civodul avatar Nov 18 '19 08:11 civodul

Can one of the admins verify this patch?

swx-jenkins3 avatar Nov 18 '19 08:11 swx-jenkins3

Hi @dmitrygx,

Thanks for your feedback. I've amended the patch following your suggestions, except one:

pls, use ucs_netif_ioctl instead

AFAICS, ucs_netif_ioctl is not applicable here because if_name would be NULL. However, I've changed this bit to use ucs_socket_create instead of socket.

Let me know what you think!

civodul avatar Nov 18 '19 10:11 civodul

AFAICS, ucs_netif_ioctl is not applicable here because if_name would be NULL. However, I've changed this bit to use ucs_socket_create instead of socket.

@civodul yes, you're right

dmitrygx avatar Nov 18 '19 10:11 dmitrygx

ok to test

dmitrygx avatar Nov 18 '19 18:11 dmitrygx

Mellanox CI: FAILED on 25 of 25 workers (click for details)

Note: the logs will be deleted after 25-Nov-2019

Agent/Stage Status
_main :x: FAILURE
hpc-arm-cavium-jenkins_W0 :x: FAILURE
hpc-arm-cavium-jenkins_W1 :x: FAILURE
hpc-arm-cavium-jenkins_W2 :x: FAILURE
hpc-arm-cavium-jenkins_W3 :x: FAILURE
hpc-arm-hwi-jenkins_W0 :x: FAILURE
hpc-arm-hwi-jenkins_W1 :x: FAILURE
hpc-arm-hwi-jenkins_W2 :x: FAILURE
hpc-arm-hwi-jenkins_W3 :x: FAILURE
hpc-test-node-gpu_W0 :x: FAILURE
hpc-test-node-gpu_W1 :x: FAILURE
hpc-test-node-gpu_W2 :x: FAILURE
hpc-test-node-gpu_W3 :x: FAILURE
hpc-test-node-legacy_W0 :x: FAILURE
hpc-test-node-legacy_W1 :x: FAILURE
hpc-test-node-legacy_W2 :x: FAILURE
hpc-test-node-legacy_W3 :x: FAILURE
hpc-test-node-new_W0 :x: FAILURE
hpc-test-node-new_W1 :x: FAILURE
hpc-test-node-new_W2 :x: FAILURE
hpc-test-node-new_W3 :x: FAILURE
r-vmb-ppc-jenkins_W0 :x: FAILURE
r-vmb-ppc-jenkins_W1 :x: FAILURE
r-vmb-ppc-jenkins_W2 :x: FAILURE
r-vmb-ppc-jenkins_W3 :x: FAILURE

mellanox-github avatar Nov 18 '19 22:11 mellanox-github

Hi @dmitrygx,

Not sure I understand what the build failures are about. Let me know if you need anything else from me.

civodul avatar Nov 19 '19 10:11 civodul

Hi @dmitrygx,

Not sure I understand what the build failures are about. Let me know if you need anything else from me.

/scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp_iface.c:646:5: error: passing argument 2 of ‘ucs_malloc’ makes pointer from integer without a cast [-Werror]
     conf.ifc_req = ucs_malloc(1, conf.ifc_len, "ifreq");
     ^
In file included from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/ucs/sys/sys.h:19:0,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/base/uct_iface.h:21,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp.h:10,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp_iface.c:11:
/scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/ucs/debug/memtrack.h:102:7: note: expected ‘const char *’ but argument is of type ‘int’
 void *ucs_malloc(size_t size, const char *name);
       ^
/scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp_iface.c:646:5: error: too many arguments to function ‘ucs_malloc’
     conf.ifc_req = ucs_malloc(1, conf.ifc_len, "ifreq");
     ^
In file included from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/ucs/sys/sys.h:19:0,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/base/uct_iface.h:21,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp.h:10,
                 from /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/uct/tcp/tcp_iface.c:11:
/scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-cavium-jenkins/worker/0/contrib/../src/ucs/debug/memtrack.h:102:7: note: declared here
 void *ucs_malloc(size_t size, const char *name);

this following has to be changed from

conf.ifc_req = ucs_malloc(1, conf.ifc_len, "ifreq");

to

conf.ifc_req = ucs_calloc(1, conf.ifc_len, "ifreq");

dmitrygx avatar Nov 19 '19 10:11 dmitrygx

Indeed... Done, thanks.

civodul avatar Nov 19 '19 19:11 civodul

Mellanox CI: FAILED on 4 of 25 workers (click for details)

Note: the logs will be deleted after 26-Nov-2019

Agent/Stage Status
_main :x: FAILURE
hpc-test-node-legacy_W0 :x: FAILURE
hpc-test-node-legacy_W2 :x: FAILURE
hpc-test-node-legacy_W3 :x: FAILURE
hpc-arm-cavium-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W0 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W1 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W2 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W3 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W1 :heavy_check_mark: SUCCESS
hpc-test-node-new_W0 :heavy_check_mark: SUCCESS
hpc-test-node-new_W1 :heavy_check_mark: SUCCESS
hpc-test-node-new_W2 :heavy_check_mark: SUCCESS
hpc-test-node-new_W3 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W0 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W1 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W2 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W3 :heavy_check_mark: SUCCESS

mellanox-github avatar Nov 19 '19 23:11 mellanox-github

Hello! The messages I received from Mellanox' CI system show that the 3 test failures are about:

Fatal: transport error: Endpoint timeout

It's unclear to me how this could relate to this patch. Thoughts?

civodul avatar Nov 20 '19 13:11 civodul

unrelated

bot:mlx:retest

brminich avatar Nov 20 '19 14:11 brminich

Mellanox CI: FAILED on 2 of 25 workers (click for details)

Note: the logs will be deleted after 27-Nov-2019

Agent/Stage Status
_main :x: FAILURE
hpc-test-node-legacy_W2 :x: FAILURE
hpc-arm-cavium-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W0 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W1 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W2 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W3 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W0 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W1 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W3 :heavy_check_mark: SUCCESS
hpc-test-node-new_W0 :heavy_check_mark: SUCCESS
hpc-test-node-new_W1 :heavy_check_mark: SUCCESS
hpc-test-node-new_W2 :heavy_check_mark: SUCCESS
hpc-test-node-new_W3 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W0 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W1 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W2 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W3 :heavy_check_mark: SUCCESS

mellanox-github avatar Nov 20 '19 19:11 mellanox-github

bot:mlx:retest

dmitrygx avatar Nov 21 '19 06:11 dmitrygx

Mellanox CI: FAILED on 3 of 25 workers (click for details)

Note: the logs will be deleted after 28-Nov-2019

Agent/Stage Status
_main :x: FAILURE
hpc-test-node-legacy_W0 :x: FAILURE
hpc-test-node-legacy_W3 :x: FAILURE
hpc-arm-cavium-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W0 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W1 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W2 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W3 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W1 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W2 :heavy_check_mark: SUCCESS
hpc-test-node-new_W0 :heavy_check_mark: SUCCESS
hpc-test-node-new_W1 :heavy_check_mark: SUCCESS
hpc-test-node-new_W2 :heavy_check_mark: SUCCESS
hpc-test-node-new_W3 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W0 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W1 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W2 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W3 :heavy_check_mark: SUCCESS

mellanox-github avatar Nov 21 '19 10:11 mellanox-github

infra issues bot:mlx:retest

dmitrygx avatar Nov 22 '19 07:11 dmitrygx

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 29-Nov-2019

Agent/Stage Status
_main :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W0 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W1 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W2 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W3 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W0 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W1 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W2 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W3 :heavy_check_mark: SUCCESS
hpc-test-node-new_W0 :heavy_check_mark: SUCCESS
hpc-test-node-new_W1 :heavy_check_mark: SUCCESS
hpc-test-node-new_W2 :heavy_check_mark: SUCCESS
hpc-test-node-new_W3 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W0 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W1 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W2 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W3 :heavy_check_mark: SUCCESS

mellanox-github avatar Nov 22 '19 09:11 mellanox-github

Mellanox CI: UNKNOWN on 17 workers (click for details)

Note: the logs will be deleted after 29-Nov-2019

Agent/Stage Status
_main :question: ABORTED
r-vmb-ppc-jenkins_W0 :question: ABORTED
r-vmb-ppc-jenkins_W3 :question: ABORTED
hpc-arm-cavium-jenkins_W0 :question: UNKNOWN
hpc-arm-cavium-jenkins_W1 :question: UNKNOWN
hpc-arm-cavium-jenkins_W2 :question: UNKNOWN
hpc-arm-cavium-jenkins_W3 :question: UNKNOWN
hpc-test-node-gpu_W0 :question: UNKNOWN
hpc-test-node-gpu_W1 :question: UNKNOWN
hpc-test-node-gpu_W2 :question: UNKNOWN
hpc-test-node-gpu_W3 :question: UNKNOWN
hpc-test-node-new_W0 :question: UNKNOWN
hpc-test-node-new_W1 :question: UNKNOWN
hpc-test-node-new_W2 :question: UNKNOWN
hpc-test-node-new_W3 :question: UNKNOWN
r-vmb-ppc-jenkins_W1 :question: UNKNOWN
r-vmb-ppc-jenkins_W2 :question: UNKNOWN

mellanox-github avatar Nov 22 '19 11:11 mellanox-github

Mellanox CI: FAILED on 2 of 25 workers (click for details)

Note: the logs will be deleted after 29-Nov-2019

Agent/Stage Status
_main :x: FAILURE
hpc-test-node-gpu_W3 :x: FAILURE
hpc-arm-cavium-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W0 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W1 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W2 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W0 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W1 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W2 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W3 :heavy_check_mark: SUCCESS
hpc-test-node-new_W0 :heavy_check_mark: SUCCESS
hpc-test-node-new_W1 :heavy_check_mark: SUCCESS
hpc-test-node-new_W2 :heavy_check_mark: SUCCESS
hpc-test-node-new_W3 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W0 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W1 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W2 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W3 :heavy_check_mark: SUCCESS

mellanox-github avatar Nov 22 '19 14:11 mellanox-github

Mellanox CI: FAILED on 2 of 25 workers (click for details)

[----------] 1 test from st/test_profile_perf
[ RUN      ] st/test_profile_perf.overhead/0 <1>
[     INFO ] overhead: 51.7127 nsec
[     INFO ] overhead: 51.8367 nsec
[     INFO ] overhead: 51.7635 nsec
[     INFO ] overhead: 51.6434 nsec
[     INFO ] overhead: 51.8108 nsec
[     INFO ] overhead: 51.6247 nsec
[     INFO ] overhead: 51.668 nsec
[     INFO ] overhead: 51.7209 nsec
[     INFO ] overhead: 51.6302 nsec
[     INFO ] overhead: 51.7486 nsec
[     INFO ] overhead: 51.7306 nsec
/scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node-gpu/worker/3/contrib/../test/gtest/ucs/test_profile.cc:393: Failure
Expected: (overhead_nsec) < (EXP_OVERHEAD_NSEC), actual: 51.7306 vs 50
Profiling overhead is too high
[  FAILED  ] st/test_profile_perf.overhead/0, where GetParam() = 1 (28584 ms)
[----------] 1 test from st/test_profile_perf (28584 ms total)

bot:retest

dmitrygx avatar Nov 22 '19 14:11 dmitrygx

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 29-Nov-2019

Agent/Stage Status
_main :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-cavium-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W0 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W1 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W2 :heavy_check_mark: SUCCESS
hpc-arm-hwi-jenkins_W3 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W0 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W1 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W2 :heavy_check_mark: SUCCESS
hpc-test-node-gpu_W3 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W0 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W1 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W2 :heavy_check_mark: SUCCESS
hpc-test-node-legacy_W3 :heavy_check_mark: SUCCESS
hpc-test-node-new_W0 :heavy_check_mark: SUCCESS
hpc-test-node-new_W1 :heavy_check_mark: SUCCESS
hpc-test-node-new_W2 :heavy_check_mark: SUCCESS
hpc-test-node-new_W3 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W0 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W1 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W2 :heavy_check_mark: SUCCESS
r-vmb-ppc-jenkins_W3 :heavy_check_mark: SUCCESS

mellanox-github avatar Nov 22 '19 21:11 mellanox-github

@civodul - I just wanted to check if you have signed CLA. What organization do you represent ? thanks !

shamisp avatar Nov 23 '19 16:11 shamisp

@shamisp I haven't signed the CLA; where can I find it? Thanks in advance!

civodul avatar Nov 24 '19 11:11 civodul

@shamisp I haven't signed the CLA; where can I find it? Thanks in advance!

See https://www.openucx.org/license, "Contributor License Agreement"

yosefe avatar Nov 24 '19 11:11 yosefe

Hi @yosefe,

As I understand it, this "Contributor License Agreement" equates to copyright assignment. The problem for me is that it fails to guarantee that my contributions will remain free software: UCX is currently distributed under the 3-clause BSD license, which is fine by me, but nothing in the CLA says that the "Copyright Holders" (capital letters) are committed to keeping code under that license. Is this correct?

Thanks, Ludo'.

civodul avatar Nov 29 '19 13:11 civodul

@civodul Response from our legal council: The UCF Contributor license agreement requires that Contributors license (not transfer) their copyrights to Copyright Holders and recipients of distributed software in order to allow UCF to freely license the specification contributions made by its Contributors. Subject to the scope of the definition of Contribution in the agreement.

If you have any additional questions I can put you in touch with our legal council. thanks !

shamisp avatar Dec 03 '19 17:12 shamisp

adding WIP until CLA is signed

yosefe avatar Dec 12 '19 17:12 yosefe