ldwang comments

Results 126 comments of


                                            ldwang

Sentence deduplication output

mark

Bug: Default Adapter assumes type of metadata column in source data

Same problem here.

Issue with Multi-node training

> I ran into similar issue with single node training. Do you have any insight to fix？ > > AssertionError: failed to get register_center_actor: t2Lrtj_register_center in [{'name': 't2LrtjWorkerDict_0:0', 'namespace': '47191b06-6a6f-407c-96b2-afc6319e5bc5'}]...

你们确定提供的代码真的能通过运行成功吗?

感谢使用和反馈。 1. 看你给的使用方式是加载huggingface上的模型，需要确认模型和配置文件是否全部下载完成。提示缺失的文件huggingface上是有的，怀疑是不是没有下载完全。 ![image](https://github.com/user-attachments/assets/a8c94b16-a3fe-4be8-8031-01f5dd6a3112) 2. 模型需要在cuda上运行，cpu上没有试过，cpu上使用的话我们一般用其他推理框架比如llama.cpp这些

你们确定提供的代码真的能通过运行成功吗?

假设模型下载目录为A，“from predict import predict”不存在的报错，可以试试 PYTHONPATH=A python main.py

add megatron_mcore saver/loader

@TING2938 Is there a tool available for converting checkpoints that include optimizer states while changing the tensor parallelism (TP) and pipeline parallelism (PP) configurations? Thank you.