Results 126 comments of ldwang

> I ran into similar issue with single node training. Do you have any insight to fix? > > AssertionError: failed to get register_center_actor: t2Lrtj_register_center in [{'name': 't2LrtjWorkerDict_0:0', 'namespace': '47191b06-6a6f-407c-96b2-afc6319e5bc5'}]...

感谢使用和反馈。 1. 看你给的使用方式是加载huggingface上的模型,需要确认模型和配置文件是否全部下载完成。 提示缺失的文件huggingface上是有的,怀疑是不是没有下载完全。 ![image](https://github.com/user-attachments/assets/a8c94b16-a3fe-4be8-8031-01f5dd6a3112) 2. 模型需要在cuda上运行,cpu上没有试过,cpu上使用的话我们一般用其他推理框架比如llama.cpp这些

假设模型下载目录为A,“from predict import predict”不存在的报错,可以试试 PYTHONPATH=A python main.py

@TING2938 Is there a tool available for converting checkpoints that include optimizer states while changing the tensor parallelism (TP) and pipeline parallelism (PP) configurations? Thank you.