FATE
FATE copied to clipboard
使用自定义数据和模型报错
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
我想使用自己的数据集和构建模型进行使用BERT模型进行文本分类任务的水平联邦训练。 根据教程文档,我重载了Dataset类,用nn.Module类写了模型,然后使用接口 trainer = FedAVGTrainer(epochs=2, batch_size=64, shuffle=True, data_loader_worker=8, pin_memory=False) trainer.local_mode() trainer.train(train_set=dataset, optimizer=optimizer, loss=loss) 可以成功训练。
但根据教程Submit a Homo-NN Task with Custom Model进行单机版模拟联邦学习时就报错。 在执行pipeline.fit()时报错: ValueError: Job is failed, please check out job 202311140705437026950 by fate board or fate_flow cli
我尝试了把重载Dataset类的文件放在/data/projects/fate/fate/python/federatedml/nn/dataset 把定义模型类的文件放在/data/projects/fate/fate/python/federatedml/nn/model_zoo 然后通过这种形式读取数据: fate_project_path = os.path.abspath('./') data_0 = {"name": 'toutiao_guess', "namespace": "experiment"} data_1 = {"name": "toutiao_host", "namespace": "experiment"}
data_path_0 = fate_project_path + '/toutiao_cat_data.txt' data_path_1 = fate_project_path + '/toutiao_cat_data.txt' reader_0 = Reader(name="reader_0") reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=data_0) reader_0.get_party_instance(role='host', party_id=host).component_param(table=data_1) dataset_param = DatasetParam(dataset_name='bert_dataset')
处理模型 import torch as t from torch import nn from pipeline import fate_torch_hook t = fate_torch_hook(t) model = t.nn.Sequential( t.nn.CustModel(module_name='bert_model', class_name='BertClassifier') )
配置训练参数和训练 nn_component = HomoNN(name='nn_0', model=model, # model loss=t.nn.CrossEntropyLoss(), # loss optimizer=t.optim.Adam(model.parameters(), lr=LR), # optimizer dataset=dataset_param, # dataset trainer=TrainerParam(trainer_name='fedavg_trainer', epochs=2, batch_size=64, validation_freqs=1), torch_seed=100 # random seed )
pipeline.add_component(reader_0) pipeline.add_component(nn_component, data=Data(train_data=reader_0.output.data)) pipeline.add_component(Evaluation(name='eval_0', eval_type='multi'), data=Data(data=nn_component.output.data))
可以给一下任务的报错信息么
| ERROR | main:
File "fate_bert_namespace.py", line 206, in
pipeline.fit() │ └ <function PipeLine.fit at 0x7f92003f6550> └ <pipeline.backend.pipeline.PipeLine object at 0x7f920464e580>
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/fate_client-1.11.2-py3.8.egg/pipeline/backend/pipeline.py", line 585, in fit self._fit_status = self._job_invoker.monitor_job_status(self._train_job_id, │ │ │ │ │ │ └ '202311140705437026950' │ │ │ │ │ └ <pipeline.backend.pipeline.PipeLine object at 0x7f920464e580> │ │ │ │ └ <function JobInvoker.monitor_job_status at 0x7f92003ec940> │ │ │ └ <pipeline.utils.invoker.job_submitter.JobInvoker object at 0x7f920464e550> │ │ └ <pipeline.backend.pipeline.PipeLine object at 0x7f920464e580> │ └ None └ <pipeline.backend.pipeline.PipeLine object at 0x7f920464e580> File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/fate_client-1.11.2-py3.8.egg/pipeline/utils/invoker/job_submitter.py", line 94, in monitor_job_status raise ValueError(f"Job is failed, please check out job {job_id} by fate board or fate_flow cli")
ValueError: Job is failed, please check out job 202311140705437026950 by fate board or fate_flow cli
Traceback (most recent call last):
File "fate_bert_namespace.py", line 206, in
202311140705437026950 这个任务在fateflow/logs 或者在fateboard里可以看到具体错误的
之前错误查到了是bert预训练模型路径写错了
现在修改以后,日志能显示完成训练了设定的epoch,但随后显示socket错误
1
[ERROR] [2023-11-29 01:23:41,868] [202311290103155381610] [467339:140703877306176] - [task_executor.run] [line:266]: HTTPConnectionPool(host='xxx.xxx.xxx.xxx', port=9380): Read timed out. (read timeout=30.0)
2
Traceback (most recent call last):
3
File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 445, in _make_request
4
six.raise_from(e, None)
5
File "
9 response.begin() 10 File "/data/projects/fate/env/python/miniconda/lib/python3.8/http/client.py", line 316, in begin 11 version, status, reason = self._read_status() 12 File "/data/projects/fate/env/python/miniconda/lib/python3.8/http/client.py", line 277, in _read_status 13 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") 14 File "/data/projects/fate/env/python/miniconda/lib/python3.8/socket.py", line 669, in readinto 15 return self._sock.recv_into(b) 16 socket.timeout: timed out
之前错误查到了是bert预训练模型路径写错了 现在修改以后,日志能显示完成训练了设定的epoch,但随后显示socket错误 1 [ERROR] [2023-11-29 01:23:41,868] [202311290103155381610] [467339:140703877306176] - [task_executor.run] [line:266]: HTTPConnectionPool(host='xxx.xxx.xxx.xxx', port=9380): 读取超时。(读取超时=30.0) 2 回溯(最近一次调用最后一次): 3 文件“/data/projects/fate/env/python/venv/lib/python3.8/site-packages/urllib3/connectionpool.py”,第 445 行,在_make_request 4 six.raise_from(e, None) 5 文件“”,第 3 行,在 raise_from 6 文件“/data/projects/fate/env/python/venv/lib/python3.8/site-packages/urllib3/connectionpool.py”,第 440 行,在 _make_request 7 httplib_response = conn.getresponse() 8
9 response.begin() 10 文件“/data/projects/fate/env/python/miniconda/lib/python3.8/http/client.py”,第 316 行,开始 11 版本,状态,原因 = self._read_status() 12 文件“/data/projects/fate/env/python/miniconda/lib/python3.8/http/client.py”,第 277 行,_read_status 13 行 = str(self.fp.readline(_MAXLINE + 1), “iso-8859-1”) 14文件“/data/projects/fate/env/python/miniconda/lib/python3.8/socket.py”,第 669 行,在 readinto 15 中返回 self._sock.recv_into(b) 16 socket.timeout:超时
想请问一下你之前的那个问题怎么解决的呢,我现在也碰到了这个问题,太难了!