FedML icon indicating copy to clipboard operation
FedML copied to clipboard

distributed can't load mnist or cifar data

Open jplnasa5 opened this issue 4 years ago • 1 comments

when I run sh run_fedavg_distributed_pytorch.sh 2 1 resnet56 homo 100 20 64 0.001 mnist "./../../../data/mnist" adam 0

show

0 2

client1

3.8.8 /usr/bin/python

client1

None False lada data 11111111111 read_data..... train_files: ['all_data_0_niid_0_keep_10_train_9.json'] file_path: ./../../../data/MNIST/train/all_data_0_niid_0_keep_10_train_9.json before cdata INFO:root:Namespace(backend='MPI', batch_size=64, ci=0, client_num_in_total=2, client_num_per_round=1, client_optimizer='adam', comm_round=100, data_dir='./../../../data/mnist', dataset='mnist', epochs=20, frequency_of_the_test=1, gpu_mapping_file='None', gpu_mapping_key='None', gpu_num_per_server=4, gpu_server_num=1, grpc_ipconfig_path='grpc_ipconfig.csv', is_mobile=1, lr=0.001, model='resnet56', partition_alpha=0.5, partition_method='homo', wd=0.0001) INFO:root:#############process ID = 0, host name = client1########, process ID = 114399, process Name = psutil.Process(pid=114399, name='FedAvg (distributed):0', status='running', started='15:32:46') INFO:root:process_id = 0, size = 1 INFO:root: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! INFO:root: ################## You do not indicate gpu_util_file, will use CPU training ################# INFO:root:cpu INFO:root:load_data. dataset_name = mnist

mpirun was unable to find the specified executable file, and therefore did not launch the job. This error was first reported for process rank 1; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command line parameter option (remember that mpirun interprets the first unrecognized command line token as the executable).

Node: client2 Executable: python

the training is blocked in cdata = json.load(inf) of data_loader.py

if change to cifar10 data, it blocked in cifar10_train_ds = CIFAR10_truncated(datadir, train=True, download=True, transform=train_transform) of data_loader.py

can you help me figure out the reason? thanks !

jplnasa5 avatar Sep 29 '21 07:09 jplnasa5

Can you train your model with gpu? I've received the same warning "You do not indicate gpu_util_file, will use CPU training".

v-thaian avatar Dec 16 '21 15:12 v-thaian

@jplnasa5 @v-thaian please check our latest examples at: https://github.com/FedML-AI/FedML/tree/master/python/examples

We've upgraded our library a lot in recent version. Here is a brief introduction: https://medium.com/@FedML/fedml-ai-platform-releases-the-worlds-federated-learning-open-platform-on-public-cloud-with-an-8024e68a70b6

chaoyanghe avatar Aug 17 '22 00:08 chaoyanghe