data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

data-juicer 有计划支持流式或微批计算么

Open xiedeyantu opened this issue 5 months ago • 8 comments

Search before continuing 先搜索,再继续

  • [x] I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。

Description 描述

比如:kafka->filter1->filter2->mapper1->files or kafka

Use case 使用场景

No response

Additional 额外信息

No response

Are you willing to submit a PR for this feature? 您是否乐意为此功能提交一个 PR?

  • [ ] Yes I'd like to help by submitting a PR! 是的!我愿意提供帮助并提交一个PR!

xiedeyantu avatar Jun 10 '25 07:06 xiedeyantu

@HYLcool 请问有这方面计划么?

xiedeyantu avatar Jun 10 '25 13:06 xiedeyantu

能更具体描述一下你的场景么?

cyruszhang avatar Jun 10 '25 16:06 cyruszhang

能更具体描述一下你的场景么?

@cyruszhang 比如:kafka->filter1->filter2->mapper1->files or kafka,就是这种,数据会源源不断的送进kafka,需要持续消费kafka,算子间一直不停计算并在最后输出成文件(可能多个)或继续送去kafka

xiedeyantu avatar Jun 10 '25 22:06 xiedeyantu

能更具体描述一下你的场景么?

@cyruszhang 比如:kafka->filter1->filter2->mapper1->files or kafka,就是这种,数据会源源不断的送进kafka,需要持续消费kafka,算子间一直不停计算并在最后输出成文件(可能多个)或继续送去kafka

@xiedeyantu We are currently developing support for similar streaming/micro-batch scenarios in another open-source library focused on reinforcement learning, Trinity-RFT. The implementation is expected to utilize dedicated Ray Actors that serve as proxies for unified data buffer read/write services (providing sequential/random Readers and Queue/DB-based Writers).

After that, we may consider generalizing this feature and implementing it back into the data-juicer library.

Of course, we welcome any other suggestions, or if you'd like to contribute to implementing this functionality as a contributor :), we'll continue to provide assistance and discussion. Feel free to comment/discuss @cyruszhang @HYLcool @pan-x-c

yxdyc avatar Jun 11 '25 07:06 yxdyc

@yxdyc Thank you very much for your detailed reply. It is very helpful to me. I would like to ask further. The Trinity-RFT project is also very complicated. Will you extract the part that supports streaming or micro-batching and integrate it into data-juicer? Or will you make data-juicer support streaming or micro-batching in some way?

xiedeyantu avatar Jun 11 '25 07:06 xiedeyantu

@yxdyc Thank you very much for your detailed reply. It is very helpful to me. I would like to ask further. The Trinity-RFT project is also very complicated. Will you extract the part that supports streaming or micro-batching and integrate it into data-juicer? Or will you make data-juicer support streaming or micro-batching in some way?

Once we complete the related implementation in Trinity-RFT (expected within the next week), we will identify the specific code files/lines for your reference. We plan to extract these components and integrate them into Data-Juicer in the coming weeks. Alternatively, you are welcome to implement the Kafka streaming feature before our future integration.

yxdyc avatar Jun 14 '25 08:06 yxdyc

@yxdyc Great job! Hope to your reply!

xiedeyantu avatar Jun 14 '25 09:06 xiedeyantu

Is the above update already scheduled?

Cccccc0630 avatar Jul 01 '25 07:07 Cccccc0630