easyFL icon indicating copy to clipboard operation
easyFL copied to clipboard

Fail to run with torch.multiprocessing

Open dnkhanh45 opened this issue 2 years ago • 9 comments

With argument num_threads > 1, I got this error: AttributeError: 'Server' object has no attribute 'delayed_communicate_with' Can someone help me? Thank you very much!

dnkhanh45 avatar Jan 05 '23 04:01 dnkhanh45

Sorry, this is a bug when simultaneously using torch.multiprocessing and decorators in python. The 'spawn' mode is incompatible with the decorators. I have tried to search for a proper solution for this bug, and I solved it by instead realizing the function of 'delayed_communicate_with' in another decorator that will not be called in subprocess in my new repo 'FLGo'. A quick solution to this bug is to comment the decorators on 'fedbase.BasicServer.communicate' and 'fedbase.BasicClient.train' if there is no need to construct the system heterogeneity. I will move the same change from FLGo to this repo and fix this bug as soon as possible. Thanks for your issue.

WwZzz avatar Jan 05 '23 05:01 WwZzz

Thank you very much for your reply. I've try your method and one more which comment all the decorators on fedbase.BasicServer.communicate, fedbase.BasicClient.train and fedbase.BasicServer.communicate_with but it still does not work.

  • First method got the same error: 'Server' object has no attribute 'delayed_communicate_with'
  • The second got: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

dnkhanh45 avatar Jan 05 '23 07:01 dnkhanh45

I am going to switch to your new repo FLGo. Thank you again for your projects, they've helped me a lot.

dnkhanh45 avatar Jan 05 '23 07:01 dnkhanh45

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

WwZzz avatar Jan 05 '23 08:01 WwZzz

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

image

WwZzz avatar Jan 05 '23 08:01 WwZzz

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

the running option is --gpu 0 --num_threads 6 --server_with_cpu --logger simple_logger

WwZzz avatar Jan 05 '23 08:01 WwZzz

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

I've tried your method, it just finishes round 1 end stucks after that: image

dnkhanh45 avatar Jan 06 '23 04:01 dnkhanh45

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

I've tried your method, it just finishes round 1 end stucks after that: image

I cannot reproduce the bug as yours. It's a little confusing. The same warning also appears in my machine, which has no obvious impact on the training. I wonder whether it is related to the GPU hardware. What will happen if using CPU to run the same command?

WwZzz avatar Jan 07 '23 02:01 WwZzz

I've commented the decorators on fedbase.BasicServer.communicate ( i.e. @ss.with_dropout and @ss.with_clock), fedbase.BasicClient.train (i.e. only @ss.with_completeness) and fedbase.BasicServer.communicate_with (i.e. @ss.with_latency). After doing this, I run main.py with --num_threads 6, it seems to work well with my machine environments. I preserved the decorator @fmodule.with_completeness on fedbase.BasicClient.train. And when I further increase num_threads, the error is CUDA OOM. Could u provide more details about the second bug? Thanks!

I've tried your method, it just finishes round 1 end stucks after that: image

I cannot reproduce the bug as yours. It's a little confusing. The same warning also appears in my machine, which has no obvious impact on the training. I wonder whether it is related to the GPU hardware. What will happen if using CPU to run the same command?

image I've got the same error with CPU. It cannot start to train round 2.

dnkhanh45 avatar Jan 11 '23 03:01 dnkhanh45