FedScale icon indicating copy to clipboard operation
FedScale copied to clipboard

[FedScale Core] fail to run if it's not in simulation mode.

Open whr819987540 opened this issue 6 months ago • 0 comments

What happened + What you expected to happen

If I set the experiment_mode to "standalone" for example, which is not "simulation", FedScale fails to run. The femnist_cluster.yml is:

# Configuration file of FAR training experiment

# ========== Cluster configuration ========== 
# ip address of the parameter server (need 1 GPU process)
ps_ip: 192.168.124.102

# ip address of each worker:# of available gpus process on each gpu in this node
# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
worker_ips:
    - 192.168.124.104:[1]
    - 192.168.124.105:[1]
    - 192.168.124.106:[1]

exp_path: $FEDSCALE_HOME/fedscale/cloud

# Entry function of executor and aggregator under $exp_path
executor_entry: execution/executor.py

aggregator_entry: aggregation/aggregator.py

auth:
    ssh_user: "whr"
    ssh_private_key: ~/.ssh/id_rsa

# cmd to run before we can indeed run FAR (in order)
setup_commands:
    - source /usr/local/miniconda3/bin/activate fedscale

# ========== Additional job configuration ========== 
# Default parameters are specified in config_parser.py, wherein more description of the parameter can be found

job_conf: 
    - job_name: femnist_cluster                   # Generate logs under this folder: log_path/job_name/time_stamp
    - log_path: $FEDSCALE_HOME/benchmark # Path of log files
    - num_participants: 2                 # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: femnist                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: $FEDSCALE_HOME/benchmark/dataset/data/femnist    # Path of the dataset
    - data_map_file: $FEDSCALE_HOME/benchmark/dataset/data/femnist/client_data_mapping/train.csv              # Allocation of data to each client, turn to iid setting if not provided
    - device_conf_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_device_capacity     # Path of the client trace
    - device_avail_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_behave_trace
    - model: resnet18             # NOTE: Please refer to our model zoo README and use models for these small image (e.g., 32x32x3) inputs
#    - model_zoo: fedscale-torch-zoo
    - eval_interval: 10                     # How many rounds to run a testing on the testing set
    - rounds: 1000                          # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - filter_less: 21                       # Remove clients w/ less than 21 samples
    - num_loaders: 2
    - local_steps: 5
    - learning_rate: 0.05
    - batch_size: 20
    - test_bsz: 20
    - use_cuda: True
    - save_checkpoint: False
    
    - experiment_mode: standalone
    - overcommitment: 1.0

The log is:

2023-12-27 14:39:19.056225: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:19.152964: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:19.480238: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:19.480270: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:19.480272: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:19 INFO     [aggregator.py:44] Job args Namespace(adam_epsilon=1e-08, backbone='./resnet50.pth', backend='gloo', batch_size=20, bidirectional=True, blacklist_max_len=0.3, blacklist_rounds=-1, block_size=64, cfg_file='./utils/rcnn/cfgs/res101.yml', clf_block_size=32, clip_bound=0.9, clip_threshold=3.0, clock_factor=2.4368231046931412, conf_path='~/dataset/', connection_timeout=60, cuda_device=None, cut_off_util=0.05, data_cache='', data_dir='/home/whr/code/FedScale/benchmark/dataset/data/femnist', data_map_file='/home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv', data_set='femnist', decay_factor=0.98, decay_round=10, device_avail_file='/home/whr/code/FedScale/benchmark/dataset/data/device_info/client_behave_trace', device_conf_file='/home/whr/code/FedScale/benchmark/dataset/data/device_info/client_device_capacity', dump_epoch=10000000000.0, embedding_file='glove.840B.300d.txt', engine='pytorch', epsilon=0.9, eval_interval=10, executor_configs='192.168.124.104:[1]=192.168.124.105:[1]=192.168.124.106:[1]', experiment_mode='standalone', exploration_alpha=0.3, exploration_decay=0.98, exploration_factor=0.9, exploration_min=0.3, filter_less=21, filter_more=1000000000000000.0, finetune=False, gamma=0.9, gradient_policy=None, hidden_layers=7, hidden_size=256, input_dim=0, input_shape=[1, 3, 28, 28], job_name='femnist_cluster', labels_path='labels.json', learning_rate=0.05, line_by_line=False, local_steps=5, log_path='/home/whr/code/FedScale/benchmark', loss_decay=0.2, malicious_factor=1000000000000000.0, max_concurrency=10, max_staleness=5, memory_capacity=2000, min_learning_rate=5e-05, mlm=False, mlm_probability=0.15, model='resnet18', model_size=65536, model_zoo='torchcv', n_actions=2, n_states=4, noise_dir=None, noise_factor=0.1, noise_max=0.5, noise_min=0.0, noise_prob=0.4, num_class=62, num_classes=35, num_executors=3, num_loaders=2, num_participants=3, output_dim=0, overcommitment=1.0, overwrite_cache=False, pacer_delta=5, pacer_step=20, proxy_mu=0.1, ps_ip='192.168.124.102', ps_port='29500', qfed_q=1.0, rnn_type='lstm', round_penalty=2.0, round_threshold=30, rounds=1000, sample_mode='random', sample_rate=16000, sample_seed=233, sample_window=5.0, save_checkpoint=True, spec_augment=False, speed_volume_perturb=False, target_delta=0.0001, target_replace_iter=15, task='cv', test_bsz=20, test_manifest='data/test_manifest.csv', test_output_dir='./logs/server', test_ratio=1.0, test_size_file='', this_rank=0, time_stamp='1227_143917', train_manifest='data/train_manifest.csv', train_size_file='', train_uniform=False, use_cuda=True, vocab_tag_size=500, vocab_token_size=10000, wandb_token='', weight_decay=0, window='hamming', window_size=0.02, window_stride=0.01, yogi_beta=0.9, yogi_beta2=0.99, yogi_eta=0.003, yogi_tau=1e-08)
(12-27) 14:39:20 INFO     [aggregator.py:164] Initiating control plane communication ...
(12-27) 14:39:20 INFO     [aggregator.py:188] %%%%%%%%%% Opening aggregator server using port [::]:29500 %%%%%%%%%%
(12-27) 14:39:20 INFO     [fllibs.py:97] Initializing the model ...
(12-27) 14:39:20 INFO     [aggregator.py:967] Start monitoring events ...
2023-12-27 14:39:31.090474: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:31.169358: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:31.478808: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:31.478836: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:31.478838: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:31 INFO     [fllibs.py:97] Initializing the model ...
(12-27) 14:39:31 INFO     [executor.py:77] (EXECUTOR:1) is setting up environ ...
(12-27) 14:39:32 INFO     [executor.py:123] Data partitioner starts ...
(12-27) 14:39:32 INFO     [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:32 INFO     [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
(12-27) 14:39:32 INFO     [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:32 INFO     [executor.py:141] Data partitioner completes ...
(12-27) 14:39:32 INFO     [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:32 INFO     [executor.py:404] Start monitoring events ...
(12-27) 14:39:32 INFO     [aggregator.py:318] Received executor 1 information, 1/3
(12-27) 14:39:32 INFO     [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:32 INFO     [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 2799, 'total_num_samples': 637858}
2023-12-27 14:39:33.925569: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:34.012208: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:34.334770: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:34.334812: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:34.334815: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:34 INFO     [fllibs.py:97] Initializing the model ...
(12-27) 14:39:34 INFO     [executor.py:77] (EXECUTOR:2) is setting up environ ...
2023-12-27 14:39:35.087146: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:35.167337: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(12-27) 14:39:35 INFO     [executor.py:123] Data partitioner starts ...
(12-27) 14:39:35 INFO     [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:35 INFO     [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
2023-12-27 14:39:35.479452: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:35.479481: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:35.479484: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:35 INFO     [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:35 INFO     [executor.py:141] Data partitioner completes ...
(12-27) 14:39:35 INFO     [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:35 INFO     [executor.py:404] Start monitoring events ...
(12-27) 14:39:35 INFO     [aggregator.py:318] Received executor 2 information, 2/3
(12-27) 14:39:35 INFO     [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:35 INFO     [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 5598, 'total_num_samples': 1275716}
(12-27) 14:39:35 INFO     [fllibs.py:97] Initializing the model ...
(12-27) 14:39:35 INFO     [executor.py:77] (EXECUTOR:3) is setting up environ ...
(12-27) 14:39:36 INFO     [executor.py:123] Data partitioner starts ...
(12-27) 14:39:36 INFO     [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:36 INFO     [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
(12-27) 14:39:36 INFO     [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:36 INFO     [executor.py:141] Data partitioner completes ...
(12-27) 14:39:36 INFO     [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:36 INFO     [executor.py:404] Start monitoring events ...
(12-27) 14:39:36 INFO     [aggregator.py:318] Received executor 3 information, 3/3
(12-27) 14:39:36 INFO     [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:36 INFO     [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 8397, 'total_num_samples': 1913574}
(12-27) 14:39:36 INFO     [aggregator.py:583] Wall clock: 0 s, round: 1, Planned participants: 0, Succeed participants: 0, Training loss: 0.0
(12-27) 14:39:36 INFO     [client_manager.py:195] Wall clock time: 0, 0 clients online, 8397 clients offline
(12-27) 14:39:36 INFO     [aggregator.py:605] Selected participants to run: []

Apparently, it selects no participants to run and the program is stuck here.

Versions / Dependencies

FedScale: 7ec441c2afa99510535adebe155d89fa8bb2c637 Python: 3.7.16 OS: Ubuntu20.04

Reproduction script

I put the aforementioned yml under $WORKDIR. So, the starting command is python $WORKDIR/docker/driver.py submit $WORKDIR/femnist_cluster.yml.

Issue Severity

None

whr819987540 avatar Dec 27 '23 06:12 whr819987540