FATE icon indicating copy to clipboard operation
FATE copied to clipboard

使用自定义数据和模型报错

Open LGDchampion opened this issue 1 year ago • 7 comments

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

LGDchampion avatar Nov 15 '23 07:11 LGDchampion

我想使用自己的数据集和构建模型进行使用BERT模型进行文本分类任务的水平联邦训练。 根据教程文档,我重载了Dataset类,用nn.Module类写了模型,然后使用接口 trainer = FedAVGTrainer(epochs=2, batch_size=64, shuffle=True, data_loader_worker=8, pin_memory=False) trainer.local_mode() trainer.train(train_set=dataset, optimizer=optimizer, loss=loss) 可以成功训练。

LGDchampion avatar Nov 15 '23 07:11 LGDchampion

但根据教程Submit a Homo-NN Task with Custom Model进行单机版模拟联邦学习时就报错。 在执行pipeline.fit()时报错: ValueError: Job is failed, please check out job 202311140705437026950 by fate board or fate_flow cli

我尝试了把重载Dataset类的文件放在/data/projects/fate/fate/python/federatedml/nn/dataset 把定义模型类的文件放在/data/projects/fate/fate/python/federatedml/nn/model_zoo 然后通过这种形式读取数据: fate_project_path = os.path.abspath('./') data_0 = {"name": 'toutiao_guess', "namespace": "experiment"} data_1 = {"name": "toutiao_host", "namespace": "experiment"}

data_path_0 = fate_project_path + '/toutiao_cat_data.txt' data_path_1 = fate_project_path + '/toutiao_cat_data.txt' reader_0 = Reader(name="reader_0") reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=data_0) reader_0.get_party_instance(role='host', party_id=host).component_param(table=data_1) dataset_param = DatasetParam(dataset_name='bert_dataset')

处理模型 import torch as t from torch import nn from pipeline import fate_torch_hook t = fate_torch_hook(t) model = t.nn.Sequential( t.nn.CustModel(module_name='bert_model', class_name='BertClassifier') )

配置训练参数和训练 nn_component = HomoNN(name='nn_0', model=model, # model loss=t.nn.CrossEntropyLoss(), # loss optimizer=t.optim.Adam(model.parameters(), lr=LR), # optimizer dataset=dataset_param, # dataset trainer=TrainerParam(trainer_name='fedavg_trainer', epochs=2, batch_size=64, validation_freqs=1), torch_seed=100 # random seed )

pipeline.add_component(reader_0) pipeline.add_component(nn_component, data=Data(train_data=reader_0.output.data)) pipeline.add_component(Evaluation(name='eval_0', eval_type='multi'), data=Data(data=nn_component.output.data))

LGDchampion avatar Nov 15 '23 07:11 LGDchampion

可以给一下任务的报错信息么

talkingwallace avatar Nov 16 '23 02:11 talkingwallace

| ERROR | main::206 - An error has been caught in function '', process 'MainProcess' (962511), thread 'MainThread' (140265679853376): Traceback (most recent call last):

File "fate_bert_namespace.py", line 206, in pipeline.fit() │ └ <function PipeLine.fit at 0x7f92003f6550> └ <pipeline.backend.pipeline.PipeLine object at 0x7f920464e580>

File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/fate_client-1.11.2-py3.8.egg/pipeline/backend/pipeline.py", line 585, in fit self._fit_status = self._job_invoker.monitor_job_status(self._train_job_id, │ │ │ │ │ │ └ '202311140705437026950' │ │ │ │ │ └ <pipeline.backend.pipeline.PipeLine object at 0x7f920464e580> │ │ │ │ └ <function JobInvoker.monitor_job_status at 0x7f92003ec940> │ │ │ └ <pipeline.utils.invoker.job_submitter.JobInvoker object at 0x7f920464e550> │ │ └ <pipeline.backend.pipeline.PipeLine object at 0x7f920464e580> │ └ None └ <pipeline.backend.pipeline.PipeLine object at 0x7f920464e580> File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/fate_client-1.11.2-py3.8.egg/pipeline/utils/invoker/job_submitter.py", line 94, in monitor_job_status raise ValueError(f"Job is failed, please check out job {job_id} by fate board or fate_flow cli")

ValueError: Job is failed, please check out job 202311140705437026950 by fate board or fate_flow cli Traceback (most recent call last): File "fate_bert_namespace.py", line 206, in pipeline.fit() File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/loguru/_logger.py", line 1251, in catch_wrapper return function(*args, **kwargs) File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/fate_client-1.11.2-py3.8.egg/pipeline/backend/pipeline.py", line 585, in fit self._fit_status = self._job_invoker.monitor_job_status(self._train_job_id, File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/fate_client-1.11.2-py3.8.egg/pipeline/utils/invoker/job_submitter.py", line 94, in monitor_job_status raise ValueError(f"Job is failed, please check out job {job_id} by fate board or fate_flow cli") ValueError: Job is failed, please check out job 202311140705437026950 by fate board or fate_flow cli

LGDchampion avatar Nov 16 '23 06:11 LGDchampion

202311140705437026950 这个任务在fateflow/logs 或者在fateboard里可以看到具体错误的

talkingwallace avatar Nov 17 '23 03:11 talkingwallace

之前错误查到了是bert预训练模型路径写错了 现在修改以后,日志能显示完成训练了设定的epoch,但随后显示socket错误 1 [ERROR] [2023-11-29 01:23:41,868] [202311290103155381610] [467339:140703877306176] - [task_executor.run] [line:266]: HTTPConnectionPool(host='xxx.xxx.xxx.xxx', port=9380): Read timed out. (read timeout=30.0) 2 Traceback (most recent call last): 3 File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 445, in _make_request 4 six.raise_from(e, None) 5 File "", line 3, in raise_from 6 File "/data/projects/fate/env/python/venv/lib/python3.8/site-packages/urllib3/connectionpool.py", line 440, in _make_request 7 httplib_response = conn.getresponse() 8

9 response.begin() 10 File "/data/projects/fate/env/python/miniconda/lib/python3.8/http/client.py", line 316, in begin 11 version, status, reason = self._read_status() 12 File "/data/projects/fate/env/python/miniconda/lib/python3.8/http/client.py", line 277, in _read_status 13 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") 14 File "/data/projects/fate/env/python/miniconda/lib/python3.8/socket.py", line 669, in readinto 15 return self._sock.recv_into(b) 16 socket.timeout: timed out

LGDchampion avatar Nov 29 '23 01:11 LGDchampion

之前错误查到了是bert预训练模型路径写错了 现在修改以后,日志能显示完成训练了设定的epoch,但随后显示socket错误 1 [ERROR] [2023-11-29 01:23:41,868] [202311290103155381610] [467339:140703877306176] - [task_executor.run] [line:266]: HTTPConnectionPool(host='xxx.xxx.xxx.xxx', port=9380): 读取超时。(读取超时=30.0) 2 回溯(最近一次调用最后一次): 3 文件“/data/projects/fate/env/python/venv/lib/python3.8/site-packages/urllib3/connectionpool.py”,第 445 行,在_make_request 4 six.raise_from(e, None) 5 文件“”,第 3 行,在 raise_from 6 文件“/data/projects/fate/env/python/venv/lib/python3.8/site-packages/urllib3/connectionpool.py”,第 440 行,在 _make_request 7 httplib_response = conn.getresponse() 8

9 response.begin() 10 文件“/data/projects/fate/env/python/miniconda/lib/python3.8/http/client.py”,第 316 行,开始 11 版本,状态,原因 = self._read_status() 12 文件“/data/projects/fate/env/python/miniconda/lib/python3.8/http/client.py”,第 277 行,_read_status 13 行 = str(self.fp.readline(_MAXLINE + 1), “iso-8859-1”) 14文件“/data/projects/fate/env/python/miniconda/lib/python3.8/socket.py”,第 669 行,在 readinto 15 中返回 self._sock.recv_into(b) 16 socket.timeout:超时

想请问一下你之前的那个问题怎么解决的呢,我现在也碰到了这个问题,太难了!

huhuiabc avatar May 16 '24 10:05 huhuiabc