sedna
sedna copied to clipboard
Error in federated learning example
What happened: When I follow https://github.com/kubeedge/sedna/tree/main/examples/federated_learning/yolov5_coco128_mistnet, an error occurred while mistnet was deploying federated learning samples
# kubectl logs yolo-v5-train-cnq8h
[2021-11-18 06:26:58,662] aggregation.py(294) [INFO] - /home/data/pretrained/
[2021-11-18 06:26:58,666] aggregation.py(314) [INFO] - address 0.0.0.0, port 7363
[INFO][06:26:58]: Server: mistnet
[INFO][06:26:58]: [Server #7] Started training on 1 clients with 1 per round.
[INFO][06:26:58]: [Server #7] Configuring the server...
[INFO][06:26:58]: Training: 1 rounds or 99.0% accuracy
[INFO][06:26:58]: Trainer: yolov5
[INFO][06:26:59]: Generating new fontManager, this may take some time...
[INFO][06:27:02]:
from n params module arguments
[INFO][06:27:02]: 0 -1 1 3520 yolov5.models.common.Focus [3, 32, 3]
[INFO][06:27:02]: 1 -1 1 18560 yolov5.models.common.Conv [32, 64, 3, 2]
[INFO][06:27:02]: 2 -1 1 18816 yolov5.models.common.C3 [64, 64, 1]
[INFO][06:27:02]: 3 -1 1 73984 yolov5.models.common.Conv [64, 128, 3, 2]
[INFO][06:27:02]: 4 -1 1 156928 yolov5.models.common.C3 [128, 128, 3]
[INFO][06:27:02]: 5 -1 1 295424 yolov5.models.common.Conv [128, 256, 3, 2]
[INFO][06:27:02]: 6 -1 1 625152 yolov5.models.common.C3 [256, 256, 3]
[INFO][06:27:02]: 7 -1 1 1180672 yolov5.models.common.Conv [256, 512, 3, 2]
[INFO][06:27:02]: 8 -1 1 656896 yolov5.models.common.SPP [512, 512, [5, 9, 13]]
[INFO][06:27:02]: 9 -1 1 1182720 yolov5.models.common.C3 [512, 512, 1, False]
[INFO][06:27:02]: 10 -1 1 131584 yolov5.models.common.Conv [512, 256, 1, 1]
[INFO][06:27:02]: 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
[INFO][06:27:02]: 12 [-1, 6] 1 0 yolov5.models.common.Concat [1]
[INFO][06:27:02]: 13 -1 1 361984 yolov5.models.common.C3 [512, 256, 1, False]
[INFO][06:27:02]: 14 -1 1 33024 yolov5.models.common.Conv [256, 128, 1, 1]
[INFO][06:27:02]: 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
[INFO][06:27:02]: 16 [-1, 4] 1 0 yolov5.models.common.Concat [1]
[INFO][06:27:02]: 17 -1 1 90880 yolov5.models.common.C3 [256, 128, 1, False]
[INFO][06:27:02]: 18 -1 1 147712 yolov5.models.common.Conv [128, 128, 3, 2]
[INFO][06:27:02]: 19 [-1, 14] 1 0 yolov5.models.common.Concat [1]
[INFO][06:27:02]: 20 -1 1 296448 yolov5.models.common.C3 [256, 256, 1, False]
[INFO][06:27:02]: 21 -1 1 590336 yolov5.models.common.Conv [256, 256, 3, 2]
[INFO][06:27:02]: 22 [-1, 10] 1 0 yolov5.models.common.Concat [1]
[INFO][06:27:02]: 23 -1 1 1182720 yolov5.models.common.C3 [512, 512, 1, False]
[INFO][06:27:02]: 24 [17, 20, 23] 1 229245 yolov5.models.yolo.Detect [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
[W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware.
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
[INFO][06:27:04]: Model Summary: 283 layers, 7276605 parameters, 7276605 gradients, 17.1 GFLOPs
[INFO][06:27:04]:
[INFO][06:27:04]: Algorithm: mistnet
[INFO][06:27:04]: [Server #7] Loading a pre-trained model.
[INFO][06:27:04]: [Server #7] Loading a model from ./models/pretrained/yolov5.pth.
Traceback (most recent call last):
File "aggregate.py", line 37, in <module>
run_server()
File "aggregate.py", line 33, in run_server
server.start()
File "/home/lib/sedna/service/server/aggregation.py", line 324, in start
self.server.run()
File "/home/plato/plato/servers/base.py", line 87, in run
self.configure()
File "/home/plato/plato/servers/fedavg.py", line 72, in configure
self.load_trainer()
File "/home/plato/plato/servers/mistnet.py", line 30, in load_trainer
self.trainer.load_model()
File "/home/plato/plato/trainers/basic.py", line 86, in load_model
self.model.load_state_dict(torch.load(model_path))
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 594, in load
with _open_file_like(f, 'rb') as opened_file:
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 211, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './models/pretrained/yolov5.pth'
I have created /model
and /pretrained
directories at the locations specified in each node according to the tutorial
I find the error code in /home/plato/plato/config.py:123
# Pretrained models
Config.params['model_dir'] = "./models/pretrained/"
Config.params['pretrained_model_dir'] = "./models/pretrained/"
I don't know why this happens. Do I need to change it to the correct path of the pre training model and repackage the image?
The docker images information :
kubeedge/sedna-example-federated-learning-mistnet-yolo-client v0.4.0 70fcd2fc71e2 2 months ago 4.95GB
kubeedge/sedna-example-federated-learning-mistnet-yolo-aggregator v0.4.0 fd0a0512f024 2 months ago 4.95GB
Environment:
Sedna Version
$ kubectl get -n sedna deploy gm -o jsonpath='{.spec.template.spec.containers[0].image}'
# kubeedge/sedna-gm:v0.4.3
$ kubectl get -n sedna ds lc -o jsonpath='{.spec.template.spec.containers[0].image}'
# kubeedge/sedna-lc:v0.4.3
Kubernets Version
$ kubectl version
# Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.12", GitCommit:"4bf2e32bb2b9fdeea19ff7cdc1fb51fb295ec407", GitTreeState:"clean", BuildDate:"2021-10-27T17:07:18Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
KubeEdge Version
$ cloudcore --version
# KubeEdge v1.8.2
$ edgecore --version
# KubeEdge v1.8.2
CloudSide Environment:
Hardware configuration
$ lscpu
# 架构: x86_64
CPU 运行模式: 32-bit, 64-bit
字节序: Little Endian
CPU: 24
在线 CPU 列表: 0-23
每个核的线程数: 2
每个座的核数: 6
座: 2
NUMA 节点: 2
厂商 ID: GenuineIntel
CPU 系列: 6
型号: 45
型号名称: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
步进: 7
CPU MHz: 2299.795
CPU 最大 MHz: 2500.0000
CPU 最小 MHz: 1200.0000
BogoMIPS: 3999.64
虚拟化: VT-x
L1d 缓存: 32K
L1i 缓存: 32K
L2 缓存: 256K
L3 缓存: 15360K
NUMA 节点0 CPU: 0-5,12-17
NUMA 节点1 CPU: 6-11,18-23
OS
$ cat /etc/os-release
# NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
Kernel
$ uname -a
# Linux node01 5.4.0-84-generic #94~18.04.1-Ubuntu SMP Thu Aug 26 23:17:46 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
PTAL @jaypume @XinYao1994
@skrlin @JoeyHwong-gk
It is very hard to understand why your image is produced 2 months ago
. Did you make sure that you have successfully updated the image?
@skrlin @JoeyHwong-gk It is very hard to understand why your image is produced
2 months ago
. Did you make sure that you have successfully updated the image?
I think the reason is that @skrlin used v0.4.0
which has a bug. And I suggest you can try the latest version(i.e. v0.4.3
).
@XinYao1994 can you help to update the version of federated learning example yaml?
@JoeyHwong-gk I didn’t update the mirror, just pulled the v0.4.0 version of the mirror in the depository according to the tutorial
@llhuii @skrlin @jaypume We have planned to add a tutorial demo recently. Hope that can help. :) Federated learning example yaml will be updated before we release that demo.
@llhuii OK, thank you very much for your answer
@XinYao1994 OK, thank you very much for your answer
@skrlin @JoeyHwong-gk It is very hard to understand why your image is produced
2 months ago
. Did you make sure that you have successfully updated the image?I think the reason is that @skrlin used
v0.4.0
which has a bug. And I suggest you can try the latest version(i.e.v0.4.3
).
我也遇到了这个问题,用的是v0.4.3,log如下:
[INFO][02:29:18]: New cache created: data/COCO/coco128/labels/train2017.cache
[INFO][02:29:18]: No clients are launched (server:disable_clients = true)
[INFO][02:29:18]: Starting a server at address 0.0.0.0 and port 7363.
[INFO][02:29:32]: 192.168.0.71 [23/Nov/2021:02:29:32 +0000] "GET /socket.io/?transport=polling&EIO=4&t=1637634572.613998 HTTP/1.1" 200 292 "-" "Python/3.6 aiohttp/3.8.0"
[INFO][02:29:32]: 192.168.0.71 [23/Nov/2021:02:29:32 +0000] "GET /socket.io/?transport=polling&EIO=4&t=1637634572.612923 HTTP/1.1" 200 292 "-" "Python/3.6 aiohttp/3.8.0"
[INFO][02:29:32]: [Server #6] A new client just connected.
[INFO][02:29:32]: [Server #6] A new client just connected.
[INFO][02:29:32]: [Server #6] New client with id #2 arrived.
[INFO][02:29:32]: [Server #6] Starting training.
[INFO][02:29:32]:
[Server #6] Starting round 1/1.
[INFO][02:29:32]: [Server #6] Selecting client #2 for training.
[INFO][02:29:32]: [Server #6] Sending the current model to client #2.
[INFO][02:29:32]: [Server #6] New client with id #1 arrived.
[INFO][02:29:37]: [Server #6] Sent 27.96 MB of payload data to client #2.
[INFO][02:31:31]: [Server #6] Received 400.11 MB of payload data from client #2.
[INFO][02:31:31]: [Server #6] All 1 client reports received. Processing.
[ERROR][02:31:31]: Task exception was never retrieved
future: <Task finished coro=<AsyncServer._handle_event_internal() done, defined at /usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py:502> exception=AttributeError("'list' object has no attribute 'num_train_examples'",)>
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py", line 504, in _handle_event_internal
r = await server._trigger_event(data[0], namespace, sid, *data[1:])
File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py", line 547, in _trigger_event
event, *args)
File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_namespace.py", line 37, in trigger_event
ret = await handler(*args)
File "/home/plato/plato/servers/base.py", line 59, in on_client_payload_done
data['obkey'])
File "/home/plato/plato/servers/base.py", line 446, in client_payload_done
await self.process_reports()
File "/home/plato/plato/servers/mistnet.py", line 40, in process_reports
sampler = all_inclusive.Sampler(feature_dataset)
File "/home/plato/plato/samplers/all_inclusive.py", line 18, in __init__
self.all_inclusive = range(dataset.num_train_examples())
AttributeError: 'list' object has no attribute 'num_train_examples'
@XinYao1994 帮忙看一下
@skrlin @JoeyHwong-gk It is very hard to understand why your image is produced
2 months ago
. Did you make sure that you have successfully updated the image?I think the reason is that @skrlin used
v0.4.0
which has a bug. And I suggest you can try the latest version(i.e.v0.4.3
).我也遇到了这个问题,用的是v0.4.3,log如下:
[INFO][02:29:18]: New cache created: data/COCO/coco128/labels/train2017.cache [INFO][02:29:18]: No clients are launched (server:disable_clients = true) [INFO][02:29:18]: Starting a server at address 0.0.0.0 and port 7363. [INFO][02:29:32]: 192.168.0.71 [23/Nov/2021:02:29:32 +0000] "GET /socket.io/?transport=polling&EIO=4&t=1637634572.613998 HTTP/1.1" 200 292 "-" "Python/3.6 aiohttp/3.8.0" [INFO][02:29:32]: 192.168.0.71 [23/Nov/2021:02:29:32 +0000] "GET /socket.io/?transport=polling&EIO=4&t=1637634572.612923 HTTP/1.1" 200 292 "-" "Python/3.6 aiohttp/3.8.0" [INFO][02:29:32]: [Server #6] A new client just connected. [INFO][02:29:32]: [Server #6] A new client just connected. [INFO][02:29:32]: [Server #6] New client with id #2 arrived. [INFO][02:29:32]: [Server #6] Starting training. [INFO][02:29:32]: [Server #6] Starting round 1/1. [INFO][02:29:32]: [Server #6] Selecting client #2 for training. [INFO][02:29:32]: [Server #6] Sending the current model to client #2. [INFO][02:29:32]: [Server #6] New client with id #1 arrived. [INFO][02:29:37]: [Server #6] Sent 27.96 MB of payload data to client #2. [INFO][02:31:31]: [Server #6] Received 400.11 MB of payload data from client #2. [INFO][02:31:31]: [Server #6] All 1 client reports received. Processing. [ERROR][02:31:31]: Task exception was never retrieved future: <Task finished coro=<AsyncServer._handle_event_internal() done, defined at /usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py:502> exception=AttributeError("'list' object has no attribute 'num_train_examples'",)> Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py", line 504, in _handle_event_internal r = await server._trigger_event(data[0], namespace, sid, *data[1:]) File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_server.py", line 547, in _trigger_event event, *args) File "/usr/local/lib/python3.6/site-packages/socketio/asyncio_namespace.py", line 37, in trigger_event ret = await handler(*args) File "/home/plato/plato/servers/base.py", line 59, in on_client_payload_done data['obkey']) File "/home/plato/plato/servers/base.py", line 446, in client_payload_done await self.process_reports() File "/home/plato/plato/servers/mistnet.py", line 40, in process_reports sampler = all_inclusive.Sampler(feature_dataset) File "/home/plato/plato/samplers/all_inclusive.py", line 18, in __init__ self.all_inclusive = range(dataset.num_train_examples()) AttributeError: 'list' object has no attribute 'num_train_examples'
@Poorunga @llhuii Please make sure you have used the most updated version because it has been fixed at here