data-juicer Add Operator-Level Parallel Data Processing with Ray Actors

Add Operator-Level Parallel Data Processing with Ray Actors

Open Cccccc0630 opened this issue 3 months ago • 1 comments

This PR introduces significant performance optimizations for parallel data processing using Ray Actors and multi-threading. The following changes have been implemented:

Dynamic Resource Allocation:
- Multiple Actors per operator (OP) are now dynamically created based on resource requirements (CPU/GPU).
- The OP using CUDA will load corresponding model onto to available resources accordingly when its Actors is created.
Parallel Data Processing:
- A data distribution thread is responsible for distributing batches of data to the first operator's actors, supporting data processing for multiple streams.

Benefits:

Improved Performance: By utilizing multi-threading and Ray Actors for parallel data processing, the system can handle large volumes of data more efficiently.
Scalability: The dynamic creation of actors based on resource availability allows the system to scale according to the workload.

Future Work (Potential Follow-up PRs):

Support parallel batch processing and implement processing_batched for OPs in the pr_demo.yaml process.
Achieve multi-Actor parallelism on GPUs to improve GPU utilization and SM% (Streaming Multiprocessor percentage).

Experiment:

The end-to-end efficiency optimization for a small amount of videos is shown in the figure below. Increasing the amount of data can make the advantages of parallel processing more obvious. juicer drawio2

Aug 19 '25 06:08 Cccccc0630

data-juicer data-juicer copied to clipboard

Add Operator-Level Parallel Data Processing with Ray Actors

Benefits:

Future Work (Potential Follow-up PRs)​:

Experiment:

data-juicer
data-juicer copied to clipboard

Future Work (Potential Follow-up PRs):