data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

Add Operator-Level Parallel Data Processing with Ray Actors

Open Cccccc0630 opened this issue 3 months ago • 1 comments

This PR introduces significant performance optimizations for parallel data processing using Ray Actors and multi-threading. The following changes have been implemented:

  1. Dynamic Resource Allocation:

    • Multiple Actors per operator (OP) are now dynamically created based on resource requirements (CPU/GPU).
    • The OP using CUDA will load corresponding model onto to available resources accordingly when its Actors is created.
  2. Parallel Data Processing:

    • A data distribution thread is responsible for distributing batches of data to the first operator's actors, supporting data processing for multiple streams.

Benefits:

  • Improved Performance: By utilizing multi-threading and Ray Actors for parallel data processing, the system can handle large volumes of data more efficiently.
  • Scalability: The dynamic creation of actors based on resource availability allows the system to scale according to the workload.

Future Work (Potential Follow-up PRs)​:

  • Support parallel batch processing and implement processing_batched for OPs in the pr_demo.yaml process.
  • Achieve multi-Actor parallelism on GPUs to improve GPU utilization and SM% (Streaming Multiprocessor percentage).

Experiment:

The end-to-end efficiency optimization for a small amount of videos is shown in the figure below. Increasing the amount of data can make the advantages of parallel processing more obvious. juicer drawio2

Cccccc0630 avatar Aug 19 '25 06:08 Cccccc0630