Yilun Huang

Results 39 comments of Yilun Huang

Hi @kyo-tom , thanks for using data-juicer! Currently, data-juicer supports starting distributed processing **on the head node**, but it does **not support submitting data to the remote cluster** for distributed...

Hi @kyo-tom , sorry for the late reply. For now, data-juicer implemented a customized data source for reading json streamingly, which has to require ray==2.47.1. With this version, an error...

Close due to this PR was included in PR #550

Close due to this PR was included in PR #550

> > Data-Juicer 中的不同算子依赖于不同的模型。该单机流程中使用的算子与 ray 模式 demo 中使用的算子存在差异,并没有下载运行 ray demo 所需要的模型。一种方法是直接用图中的算子替换 ray demo 中对应的部分并运行;另一种方法是将 ray demo 中的算子部分拷贝到单机版本 demo 中运行,运行结束后再尝试切换到 ray 模式下运行 > > 多谢多谢! > > 另外有个问题想请教一下:尝试对ExtractKeywordMapper算子作了修改,将默认的openai接口替换为了本地下载的huggingface模型 (Qwen2.5-32B-Instruct-AWQ) 。在单机模式下运行遇到如下报错 (输入为1个400MB左右的jsonl文件)...

Hi, this info message was not yielded by Data-Juicer, and I can't find any useful information on the Internet. I think it's outputted by pyscenedetect or ffmpeg. I recommend you...

> Also there is llava-pretrain and sbu558k both exist in the aligment data, I wonder the difference between them. According to the Cambrian paper, in my opinion, the Cambrian-Alignment consists...