Timeout when running homa_client_tput experiment - 2 node Cloudlab setup

Open KunalDaga opened this issue 8 months ago • 1 comments

Hello, I'm trying to run cp_basic on a CloudLab instance, and it gets through a few of the experiments, but gets stuck at homa_client_tput experiment always, because it exceeds the timeout of 40s to communicate with the client. The experiments before seem to run fine, without any issues:

kdaga7@node0:~/HomaModule/util$ ./cp_basic -n 2
Starting servers for homa_vs_tcp experiment on nodes range(1, 2)
Starting clients for homa_latency experiment on nodes range(0, 1)
Starting measurements
Freezing timetraces via node0
Retrieving data for homa_latency experiment
Starting clients for homa_1msg_tput experiment on nodes range(0, 1)
Starting measurements
Freezing timetraces via node0
Retrieving data for homa_1msg_tput experiment
Starting clients for homa_client_rpc_tput experiment on nodes range(0, 1)
Starting measurements
Freezing timetraces via node0
Retrieving data for homa_client_rpc_tput experiment
Starting clients for homa_client_tput experiment on nodes range(0, 1)
Traceback (most recent call last):
  File "/users/kdaga7/HomaModule/util/./cp_basic", line 96, in <module>
    run_experiment("homa_client_tput", range(0, 1), o)
  File "/users/kdaga7/HomaModule/util/cperf.py", line 710, in run_experiment
    wait_output("% ", nodes, command, 40.0)
  File "/users/kdaga7/HomaModule/util/cperf.py", line 413, in wait_output
    raise Exception("timeout (%.1fs) exceeded for command '%s' on node%d"
Exception: timeout (40.0s) exceeded for command 'client --ports 9 --port-receivers 1 --server-ports 6 --workload 500000 --servers 0,1 --gbps 0.000 --client-max 50 --protocol homa --id 0 --exp homa_client_tput ' on node0

I've attached reports/cperf.log, but I can't seem to find node0.log. Let me know if there are any other files I should provide.

cperf.log

Thanks, in advance

Apr 20 '25 22:04 KunalDaga

I will need to see complete logs in order to debug this. There should be a log directory created by the experiment (you can specify one with the --log-dir option, or else a default one will be chosen); it should contain a bunch of files, include node0.log etc., plus a subdirectory "reports" that should contain the cperf.log directory. Zip up the entire log directory and attach it here.

Also, be sure to specify the exact command line you typed, and which cluster you are running on.

What kernel image are you using? Is it one of mine? Please send the results of the "uname -a" command.

What commit of Homa are you using?

Hopefully this will give me more information to track down the problem you are having.

Apr 21 '25 05:04 johnousterhout