secretflow 使用sf.tune运行联邦学习时程序运行不成功

Issue Type

Bug

Source

source

Secretflow Version

latest

OS Platform and Distribution

wsl2

Python version

3.8.13

Bazel version

No response

GCC/Compiler version

No response

What happend and What you expected to happen.

在secretflow的docker容器中尝试使用secretflow.tune进行对“使用Pytorch后端来进行联邦学习”这一sample程序的调参，但是程序迟迟没有终止，只是运行sample的话几分钟就可以运行完成，试着用secretflow.tune运行数学式子也可以运行成功。

Reproduction code to reproduce the issue.

import secretflow as sf
sf.shutdown()
sf.init(['alice', 'bob', 'charlie'], address='local')

from secretflow.ml.nn.utils import BaseModule, TorchModel
from secretflow.ml.nn.fl.utils import metric_wrapper, optim_wrapper
from secretflow.ml.nn import FLModel
from torchmetrics import Accuracy, Precision
from secretflow.security.aggregation import SecureAggregator
from secretflow.utils.simulation.datasets import load_mnist
from torch import nn, optim
from torch.nn import functional as F
from secretflow import tune

class ConvNet(BaseModule):
	def __init__(self):
		super(ConvNet, self).__init__()
		self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
		self.fc_in_dim = 192
		self.fc = nn.Linear(self.fc_in_dim, 10)
	def forward(self, x):
		x = F.relu(F.max_pool2d(self.conv1(x), 3))
		x = x.view(-1, self.fc_in_dim)
		x = self.fc(x)
		return F.softmax(x, dim=1)
	(train_data, train_label), (test_data, test_label) = load_mnist(
                parts={alice: 0.4, bob: 0.6},
                normalized_x=True,
                categorical_y=True,
                is_torch=True,
        )

def trainable(config):
	alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')
	loss_fn = nn.CrossEntropyLoss
	optim_fn = optim_wrapper(optim.Adam, lr=config["lr"])
	model_def = TorchModel(
                model_fn=ConvNet,
                loss_fn=loss_fn,
                optim_fn=optim_fn,
                metrics=[
                        metric_wrapper(Accuracy, task="multiclass", num_classes=10, average='micro'),
                        metric_wrapper(Precision, task="multiclass", num_classes=10, average='micro'),
                ],
        )
	device_list = [alice, bob]
	server = charlie
	aggregator = SecureAggregator(server, [alice, bob])
	fl_model = FLModel(
                server=server,
                device_list=device_list,
                model=model_def,
                aggregator=aggregator,
                strategy='fed_avg_w',
                backend="torch",
        )
	history = fl_model.fit(
                train_data,
                train_label,
                validation_data=(test_data, test_label),
                epochs=5,
                batch_size=64,
                aggregate_freq=1,
        )
	acc = history["global_history"]['multiclassaccuracy'][4]
	return {"mean_accuracy": acc}

tuner = tune.Tuner(
        trainable,
        cluster_resources=[
                {'alice': 1, 'CPU': 16},
                {'bob': 1, 'CPU': 16},
                {'charlie': 1, 'CPU': 12},
        ],
        param_space={"lr": tune.grid_search([1e-2, 2e-2])},
)
tuner.fit()

Mar 04 '24 10:03 shnnosuke34725

OK 已收到我们内部查看下

Mar 05 '24 03:03 Chrisdehe

@shnnosuke34725 hi，久等了。会造成始终无法终止的根本原因，是指定的资源配置不足，导致tune无法分配到合适的资源。 tune.Tuner中的cluster_resource这一参数中，你指定了三个dict，这表明本次实验（也就是trainable）会在实验中启动三种远程的ray worker，每个worker分别会使用这三个dict中的资源。这样指定是正确的，但事实上FL训练过程中，alice、bob、charlie标签的使用量不止1个，而这里你指定了'alice':1，从而导致资源无法正常分配。这里每个dict中，应该让角色标签和CPU标签后的值一致，应该就能够正常分配资源了。

此外可能的问题：

提供的脚本中，CPU的数量为16+16+12=44，如果不在sf.init中指定cpu的数量，请确保你的机器确实有这么多CPU，不然分配资源超额也可能有问题。
trainable实验最终会通过ray分配到远程执行，load_mnist这样的加载资源函数，应当放在trainable内执行，放在外面会涉及大量拷贝操作，且ray支持的拷贝量是有上限的，这个脚本应该就超额了。

另外，如果仅用于测试和科研，建议可以直接使用sf.init(debug_mode=True)，开启debug模式+sf tune进行调优，这样可以免去很多复杂的资源指定，仅需要指定每个实验的CPU数量即可。

感谢提交相关问题，如果有合适使用场景，欢迎和我们联系，并提出你的问题和需求~

Mar 05 '24 07:03 fy222fy

深入沟通请添加隐语技术支持（secretflow02）备注公司/姓名以及留下issue号哦

Mar 05 '24 07:03 Chrisdehe

另外如果不使用资源支配功能，仅测试的话，cluster_resources也可以不填，默认单并行度。

Mar 05 '24 10:03 fy222fy

可以了，非常感谢！！

Mar 06 '24 03:03 shnnosuke34725

Stale issue message. Please comment to remove stale tag. Otherwise this issue will be closed soon.

Apr 05 '24 09:04 github-actions[bot]