FATE
FATE copied to clipboard
1.7.0 issue during debug:`RuntimeError: table not exist:` after adding breakpoints in IDE
Describe the bug For convenience, I will describe this issue in Chinese.
用IDE(pycharm or vscode)debug的时候,之前(1.6.0)debug某一组件(例如hetero_lr)的方法都是,通过schedule log 里查询相关信息,运行task_executor.py
,并且指定task_id、party
等等。 在1.6.0 版本时是works fine的。
但是在1.7.0版本时,这样做通常也是可以的。通过IDE的run,或者不加断点的debug,都不会有问题,最终的task 都会success。
但是:如果加上一个断点,这一轮运行也没有问题。但第二次运行task_executor.py
时,就会出现RuntimeError: table not exist:!
这个bug。
To Reproduce Steps to reproduce the behavior:
- 使用FATE 1.7.0的版本。
- 先跑一遍
flow job submit -c hetero_lr_normal_conf.json -d hetero_lr_normal_dsl.json
- 从
fate_flow_schedule.log
获取task_executor.py
hetero_lr
的运行参数 - 通过IDE,运行
task_executor.py
, 并指定相关参数 - 以上应该会运行成功
- 在
hetero_lr_guest.py
加上断点 - IDE中debug
- 第一次debug应该也没问题
- 再次debug,出现``RuntimeError: table not exist:!`
Traceback (most recent call last):
File "/root/standalone_fate_install_1.7.0_release/fateflow/python/fate_flow/worker/task_executor.py", line 195, in _run_
cpn_output = run_object.run(cpn_input)
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/model_base.py", line 209, in run
method(cpn_input)
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/model_base.py", line 247, in _run
this_data_output = func(*params)
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_base.py", line 170, in fit
self.fit_binary(data_instances, validate_data)
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_base.py", line 249, in fit_binary
cipher=self.cipher_tool[batch_idx])
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_host.py", line 77, in forward
self.fixedpoint_encoder)
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/secureprotol/spdz/secure_matrix/secure_matrix.py", line 143, in from_source
cipher=cipher)
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/secureprotol/spdz/tensor/fixedpoint_table.py", line 384, in from_source
share = spdz.communicator.get_share(tensor_name=tensor_name, party=source)[0]
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/secureprotol/spdz/communicator/federation.py", line 59, in get_share
return self._share_variable.get_parties(party, suffix=(tensor_name,))
File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/federation/transfer_variable.py", line 241, in get_parties
name=name, tag=tag, parties=parties, gc=self._get_gc
File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/federation/standalone/_federation.py", line 49, in get
rtn = self._federation.get(name=name, tag=tag, parties=parties)
File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/_standalone.py", line 588, in get
session=self._session, name=r[0], namespace=r[1], need_cleanup=True
File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/_standalone.py", line 699, in _load_table
raise RuntimeError(f"table not exist: name={name}, namespace={namespace}")
RuntimeError: table not exist: name=9a3784b8-7e7d-11ec-aa3a-00505682f538, namespace=202201261546330932510_hetero_sshe_lr_0_0_guest_9999
finish hetero_sshe_lr_0 202201261546330932510_hetero_sshe_lr_0 0 on host 10000 with failed
Process finished with exit code 0
Additional context 猜测可能是fate_flow 1.7 版本中加入的worker、session导致的?
debug的时候是有一定倾入性的 可能影响到什么状态了 有空的话我们会尝试去跟进找出原因 如果你有什么发现,欢迎在这里同步, 谢谢
debug的时候是有一定倾入性的 可能影响到什么状态了 有空的话我们会尝试去跟进找出原因 如果你有什么发现,欢迎在这里同步, 谢谢
update:
找到了临时的解决办法,如有异常,删除data/${job_id}_${component_name} 目录下的所有文件,例如,data/202202090957301709460_hetero_sshe_lr_0_0
, 如有必要,同时删除data/202202090957301709460_hetero_sshe_lr_0_0_host_10000
以及 data/202202090957301709460_hetero_sshe_lr_0_0_guest_9999
说明:
正常启动/或直接运行task_executor.py
/或不加断点的debug,在程序运行成功后,会自动删除/data 目录下的相关文件。
然而,一旦程序中途运行中止/或者加上断点debug,则/data 目录下的相关文件并未完全清理,如下图所示。
如再次运行task_executor.py
,guest会报错 RuntimeError: table not exist: name=9d27f475-8969-11ec-aa3a-00505682f538, namespace=202202090957301709460_hetero_sshe_lr_0_0_host_10000
, 手动删除相关文件即可
Connected to pydev debugger (build 201.8743.20)
Traceback (most recent call last):
File "/root/standalone_fate_install_1.7.0_release/fateflow/python/fate_flow/worker/task_executor.py", line 195, in _run_
cpn_output = run_object.run(cpn_input)
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/model_base.py", line 209, in run
method(cpn_input)
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/model_base.py", line 247, in _run
this_data_output = func(*params)
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_base.py", line 170, in fit
self.fit_binary(data_instances, validate_data)
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_base.py", line 249, in fit_binary
cipher=self.cipher_tool[batch_idx])
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_guest.py", line 81, in forward
z=None)[0]
File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/secureprotol/spdz/secure_matrix/secure_matrix.py", line 109, in share_encrypted_matrix
suffix=(var_name,) + current_suffix)
File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/federation/transfer_variable.py", line 313, in get
rtn = self.get_parties(parties=src_parties[idx], suffix=suffix)[0]
File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/federation/transfer_variable.py", line 241, in get_parties
name=name, tag=tag, parties=parties, gc=self._get_gc
File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/federation/standalone/_federation.py", line 49, in get
rtn = self._federation.get(name=name, tag=tag, parties=parties)
File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/_standalone.py", line 588, in get
session=self._session, name=r[0], namespace=r[1], need_cleanup=True
File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/_standalone.py", line 699, in _load_table
raise RuntimeError(f"table not exist: name={name}, namespace={namespace}")
RuntimeError: table not exist: name=9d27f475-8969-11ec-aa3a-00505682f538, namespace=202202090957301709460_hetero_sshe_lr_0_0_host_10000
finish hetero_sshe_lr_0 202202090957301709460_hetero_sshe_lr_0 0 on guest 9999 with failed
Process finished with exit code 0
环境
standalone 用IDE debug
可能是gc或need_cleanup 那里的问题?
btw,代码不知道如何修复
debug的时候是有一定倾入性的 可能影响到什么状态了 有空的话我们会尝试去跟进找出原因 如果你有什么发现,欢迎在这里同步, 谢谢
update:
找到了临时的解决办法,如有异常,删除data/${job_id}_${component_name} 目录下的所有文件,例如,
data/202202090957301709460_hetero_sshe_lr_0_0
, 如有必要,同时删除data/202202090957301709460_hetero_sshe_lr_0_0_host_10000
以及data/202202090957301709460_hetero_sshe_lr_0_0_guest_9999
说明:
正常启动/或直接运行
task_executor.py
/或不加断点的debug,在程序运行成功后,会自动删除/data 目录下的相关文件。 然而,一旦程序中途运行中止/或者加上断点debug,则/data 目录下的相关文件并未完全清理,如下图所示。如再次运行
task_executor.py
,guest会报错RuntimeError: table not exist: name=9d27f475-8969-11ec-aa3a-00505682f538, namespace=202202090957301709460_hetero_sshe_lr_0_0_host_10000
, 手动删除相关文件即可Connected to pydev debugger (build 201.8743.20) Traceback (most recent call last): File "/root/standalone_fate_install_1.7.0_release/fateflow/python/fate_flow/worker/task_executor.py", line 195, in _run_ cpn_output = run_object.run(cpn_input) File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/model_base.py", line 209, in run method(cpn_input) File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/model_base.py", line 247, in _run this_data_output = func(*params) File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_base.py", line 170, in fit self.fit_binary(data_instances, validate_data) File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_base.py", line 249, in fit_binary cipher=self.cipher_tool[batch_idx]) File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_guest.py", line 81, in forward z=None)[0] File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/secureprotol/spdz/secure_matrix/secure_matrix.py", line 109, in share_encrypted_matrix suffix=(var_name,) + current_suffix) File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/federation/transfer_variable.py", line 313, in get rtn = self.get_parties(parties=src_parties[idx], suffix=suffix)[0] File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/federation/transfer_variable.py", line 241, in get_parties name=name, tag=tag, parties=parties, gc=self._get_gc File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/federation/standalone/_federation.py", line 49, in get rtn = self._federation.get(name=name, tag=tag, parties=parties) File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/_standalone.py", line 588, in get session=self._session, name=r[0], namespace=r[1], need_cleanup=True File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/_standalone.py", line 699, in _load_table raise RuntimeError(f"table not exist: name={name}, namespace={namespace}") RuntimeError: table not exist: name=9d27f475-8969-11ec-aa3a-00505682f538, namespace=202202090957301709460_hetero_sshe_lr_0_0_host_10000 finish hetero_sshe_lr_0 202202090957301709460_hetero_sshe_lr_0 0 on guest 9999 with failed Process finished with exit code 0
环境
standalone 用IDE debug
可能是gc或need_cleanup 那里的问题?
btw,代码不知道如何修复
是的 应该跟need_cleanup 有关,里面用了python的对象gc的callback,但是这个机制并不可靠,debug的时候可能阻塞了这个流程。更麻烦的是federation的时候还涉及到多进程操作,断点会在第一个进程执行到对应的位置的时候挂起所有其他进程。如果要彻底消除这个问题可能需要重新设计单机版的机制。
另外,作为临时的方案,可以把need_cleanup 关掉,带来的副作用是本地数据不会自动清理,需要定期自己手动清理。
这个issue会保留开启状态,也许我们可以在未来某个时间找到简洁完美的方案。
debug的时候是有一定倾入性的 可能影响到什么状态了 有空的话我们会尝试去跟进找出原因 如果你有什么发现,欢迎在这里同步, 谢谢
update:
找到了临时的解决办法,如有异常,删除data/${job_id}_${component_name} 目录下的所有文件,例如,
data/202202090957301709460_hetero_sshe_lr_0_0
, 如有必要,同时删除data/202202090957301709460_hetero_sshe_lr_0_0_host_10000
以及data/202202090957301709460_hetero_sshe_lr_0_0_guest_9999
说明:
正常启动/或直接运行
task_executor.py
/或不加断点的debug,在程序运行成功后,会自动删除/data 目录下的相关文件。 然而,一旦程序中途运行中止/或者加上断点debug,则/data 目录下的相关文件并未完全清理,如下图所示。如再次运行
task_executor.py
,guest会报错RuntimeError: table not exist: name=9d27f475-8969-11ec-aa3a-00505682f538, namespace=202202090957301709460_hetero_sshe_lr_0_0_host_10000
, 手动删除相关文件即可Connected to pydev debugger (build 201.8743.20) Traceback (most recent call last): File "/root/standalone_fate_install_1.7.0_release/fateflow/python/fate_flow/worker/task_executor.py", line 195, in _run_ cpn_output = run_object.run(cpn_input) File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/model_base.py", line 209, in run method(cpn_input) File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/model_base.py", line 247, in _run this_data_output = func(*params) File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_base.py", line 170, in fit self.fit_binary(data_instances, validate_data) File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_base.py", line 249, in fit_binary cipher=self.cipher_tool[batch_idx]) File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/linear_model/logistic_regression/hetero_sshe_logistic_regression/hetero_lr_guest.py", line 81, in forward z=None)[0] File "/root/standalone_fate_install_1.7.0_release/fate/python/federatedml/secureprotol/spdz/secure_matrix/secure_matrix.py", line 109, in share_encrypted_matrix suffix=(var_name,) + current_suffix) File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/federation/transfer_variable.py", line 313, in get rtn = self.get_parties(parties=src_parties[idx], suffix=suffix)[0] File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/federation/transfer_variable.py", line 241, in get_parties name=name, tag=tag, parties=parties, gc=self._get_gc File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/federation/standalone/_federation.py", line 49, in get rtn = self._federation.get(name=name, tag=tag, parties=parties) File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/_standalone.py", line 588, in get session=self._session, name=r[0], namespace=r[1], need_cleanup=True File "/root/standalone_fate_install_1.7.0_release/fate/python/fate_arch/_standalone.py", line 699, in _load_table raise RuntimeError(f"table not exist: name={name}, namespace={namespace}") RuntimeError: table not exist: name=9d27f475-8969-11ec-aa3a-00505682f538, namespace=202202090957301709460_hetero_sshe_lr_0_0_host_10000 finish hetero_sshe_lr_0 202202090957301709460_hetero_sshe_lr_0 0 on guest 9999 with failed Process finished with exit code 0
环境
standalone 用IDE debug 可能是gc或need_cleanup 那里的问题? btw,代码不知道如何修复
是的 应该跟need_cleanup 有关,里面用了python的对象gc的callback,但是这个机制并不可靠,debug的时候可能阻塞了这个流程。更麻烦的是federation的时候还涉及到多进程操作,断点会在第一个进程执行到对应的位置的时候挂起所有其他进程。如果要彻底消除这个问题可能需要重新设计单机版的机制。
另外,作为临时的方案,可以把need_cleanup 关掉,带来的副作用是本地数据不会自动清理,需要定期自己手动清理。
这个issue会保留开启状态,也许我们可以在未来某个时间找到简洁完美的方案。
感谢深夜的回复!
BTW,我之前试过把need_cleanup 关掉,仍无法解决问题,会报encrypted_number was encrypted against a different key!
错误,推测拿到了上一次跑的加密内容,没法解密。 因此目前还是保持代码不变,手动删除相关文件。
这个只是调试才会出现bug,我无法解决, I will leave this issue alone。
顺便问一下,不知道有没有FATE中 存储、Table、federation 相关的文档或PPT?
Thanks anyway。
顺便问一下,不知道有没有FATE中 存储、Table、federation 相关的文档或PPT?
我这里也没有专门的资料,以代码为准
您好,我在1.6的docker版本也遇到了这个问题,但1.6版本好像没有data这个文件夹,而且您也说1.6是没有问题的,请问一下,需要做什么设置吗?不甚感激