easyFL
easyFL copied to clipboard
Fail to run with torch.multiprocessing
With argument num_threads > 1
, I got this error:
AttributeError: 'Server' object has no attribute 'delayed_communicate_with'
Can someone help me? Thank you very much!
Sorry, this is a bug when simultaneously using torch.multiprocessing and decorators in python. The 'spawn' mode is incompatible with the decorators. I have tried to search for a proper solution for this bug, and I solved it by instead realizing the function of 'delayed_communicate_with' in another decorator that will not be called in subprocess in my new repo 'FLGo'. A quick solution to this bug is to comment the decorators on 'fedbase.BasicServer.communicate' and 'fedbase.BasicClient.train' if there is no need to construct the system heterogeneity. I will move the same change from FLGo to this repo and fix this bug as soon as possible. Thanks for your issue.
Thank you very much for your reply. I've try your method and one more which comment all the decorators on fedbase.BasicServer.communicate
, fedbase.BasicClient.train
and fedbase.BasicServer.communicate_with
but it still does not work.
- First method got the same error: 'Server' object has no attribute 'delayed_communicate_with'
- The second got: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling
cublasCreate(handle)
I am going to switch to your new repo FLGo. Thank you again for your projects, they've helped me a lot.
I've commented the decorators on fedbase.BasicServer.communicate
( i.e. @ss.with_dropout
and @ss.with_clock
), fedbase.BasicClient.train
(i.e. only @ss.with_completeness
) and fedbase.BasicServer.communicate_with
(i.e. @ss.with_latency
). After doing this, I run main.py
with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness
on fedbase.BasicClient.train
. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!
I've commented the decorators on
fedbase.BasicServer.communicate
( i.e.@ss.with_dropout
and@ss.with_clock
),fedbase.BasicClient.train
(i.e. only@ss.with_completeness
) andfedbase.BasicServer.communicate_with
(i.e.@ss.with_latency
). After doing this, I runmain.py
with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator@fmodule.with_completeness
onfedbase.BasicClient.train
. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!
I've commented the decorators on
fedbase.BasicServer.communicate
( i.e.@ss.with_dropout
and@ss.with_clock
),fedbase.BasicClient.train
(i.e. only@ss.with_completeness
) andfedbase.BasicServer.communicate_with
(i.e.@ss.with_latency
). After doing this, I runmain.py
with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator@fmodule.with_completeness
onfedbase.BasicClient.train
. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!
the running option is --gpu 0 --num_threads 6 --server_with_cpu --logger simple_logger
I've commented the decorators on
fedbase.BasicServer.communicate
( i.e.@ss.with_dropout
and@ss.with_clock
),fedbase.BasicClient.train
(i.e. only@ss.with_completeness
) andfedbase.BasicServer.communicate_with
(i.e.@ss.with_latency
). After doing this, I runmain.py
with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator@fmodule.with_completeness
onfedbase.BasicClient.train
. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!
I've tried your method, it just finishes round 1 end stucks after that:
I've commented the decorators on
fedbase.BasicServer.communicate
( i.e.@ss.with_dropout
and@ss.with_clock
),fedbase.BasicClient.train
(i.e. only@ss.with_completeness
) andfedbase.BasicServer.communicate_with
(i.e.@ss.with_latency
). After doing this, I runmain.py
with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator@fmodule.with_completeness
onfedbase.BasicClient.train
. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!I've tried your method, it just finishes round 1 end stucks after that:
I cannot reproduce the bug as yours. It's a little confusing. The same warning also appears in my machine, which has no obvious impact on the training. I wonder whether it is related to the GPU hardware. What will happen if using CPU to run the same command?
I've commented the decorators on
fedbase.BasicServer.communicate
( i.e.@ss.with_dropout
and@ss.with_clock
),fedbase.BasicClient.train
(i.e. only@ss.with_completeness
) andfedbase.BasicServer.communicate_with
(i.e.@ss.with_latency
). After doing this, I runmain.py
with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator@fmodule.with_completeness
onfedbase.BasicClient.train
. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!I've tried your method, it just finishes round 1 end stucks after that:
I cannot reproduce the bug as yours. It's a little confusing. The same warning also appears in my machine, which has no obvious impact on the training. I wonder whether it is related to the GPU hardware. What will happen if using CPU to run the same command?
I've got the same error with CPU. It cannot start to train round 2.