ZHIPENG MIAO issues

Results 7 issues of


                                            ZHIPENG MIAO

请问ABC三方（A是GUEST,BC均为HOST）训练出的横向LR模型，A在发起在线预测时，要求BC都在线吗？

A持有完整模型，A输入完整的特征，可不可以在A方完成在线预测，而不需要BC参与？如果上述过程可以实现，该如何配置呢？谢谢

[Help] <使用deepspeed全量模型微调，内存不够用>

### Is there an existing issue for this? - [X] I have searched the existing issues ### Current Behavior 环境是5个RTX3090，300多GB内存，使用了deepspeed的zero3，把优化器和模型参数都offload到内存 ![image](https://github.com/THUDM/ChatGLM-6B/assets/45615979/65b394af-638c-4552-ae0c-2b8a57dc0abf) 跑deepspeed全量模型微调跑到1000个step的时候，做checkpoint，写入文件失败，查询发现是内存不够想问问作者跑deepspeed的时候，用的服务器是多大内存 ![image](https://github.com/THUDM/ChatGLM-6B/assets/45615979/080b0781-5964-4489-82e7-ab14631aee06) ![image](https://github.com/THUDM/ChatGLM-6B/assets/45615979/c8c4e7f4-031d-4747-a7ca-445d76c4cea2) ### Expected Behavior 期望可以正常跑完训练...

[BUG] PytorchStreamWriter failed writing file data/3: file write failed

### Is there an existing issue for this? - [x] I have searched the existing issues ### Current Behavior [2023-06-02 00:34:14,470] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./output/deepspeed/checkpoint-1428/global_step1428/zero_pp_rank_0_mp_rank_00_model_states.pt [2023-06-02...

[Bug] OSError: could not get source code

### Is there an existing issue for this? - [X] I have searched the existing issues ### Current Behavior ![image](https://github.com/THUDM/ChatGLM-6B/assets/45615979/1db2ff06-24b2-4f25-bb0b-c7c7d59f2a71) 执行sh ds_train_finetune.sh后，有很大的概率出现这个问题，导致没办法开始训练。但是多执行几次，大概五六次，这个问题就会消失，可以开始训练训练完毕后，再次执行sh ds_train_finetune.sh，也有很大的概率出现这个问题，导致没办法开始训练。但是多执行几次，大概五六次，这个问题就会消失，可以再次训练还有下面这个问题，也是多执行几次，就会执行成功 ModuleNotFoundError: No module named 'transformers_modules.model.configuration_chatglm'...

homo-xgb 报错 ‘pika.exceptions.StreamLostError: Stream connection lost: ConnectionResetError(104, 'Connection reset by peer')’

**Describe the bug** 两方四十万行数据集做homo-xgb，arbiter方报错‘pika.exceptions.StreamLostError: Stream connection lost: ConnectionResetError(104, 'Connection reset by peer')’ **To Reproduce** Steps to reproduce the behavior: 1. conf文件 `{ "job_parameters":{ "common":{ "job_type":"train", "model_version":"202303311649128678570", "pulsar_run":{}, "auto_retries":0, "computing_engine":"SPARK", "model_id":"arbiter-1640046801#guest-1639995892#host-1639998474#model",...

请问如何在eggroll里新增一个sort算子？

想在fate里新增一个类似spark中sortBy的方法，如果用eggroll作为计算引擎估计不太支持。 1.请问有计划在eggroll中增加更多的算子吗？ 2.如果我要在eggroll中增加算子，步骤是什么？非常感谢！

“数据的分片和重构支持离线和在线两种模式” 在哪里？

README.md文档中，基于MPC训练，“安全训练和推理工作完成后，模型（或预测结果）将由计算方以加密形式输出。结果方可以收集加密的结果，使用PFM中的工具对其进行解密，并将明文结果传递给用户（目前数据的分片和重构支持离线和在线两种模式）。” ，其中的 "在线进行数据的分片和重构" 在哪个demo中可以看到？ https://github.com/PaddlePaddle/PaddleFL/blob/master/README_cn.md#c-%E7%BB%93%E6%9E%9C%E9%87%8D%E6%9E%84 ![image](https://user-images.githubusercontent.com/45615979/156875458-3556ad89-73b0-4eb3-9651-c5e0d75fef70.png)