data-juicer
data-juicer copied to clipboard
Add Operator-Level Parallel Data Processing with Ray Actors
This PR introduces significant performance optimizations for parallel data processing using Ray Actors and multi-threading. The following changes have been implemented:
-
Dynamic Resource Allocation:
- Multiple Actors per operator (OP) are now dynamically created based on resource requirements (CPU/GPU).
- The OP using CUDA will load corresponding model onto to available resources accordingly when its Actors is created.
-
Parallel Data Processing:
- A data distribution thread is responsible for distributing batches of data to the first operator's actors, supporting data processing for multiple streams.
Benefits:
- Improved Performance: By utilizing multi-threading and Ray Actors for parallel data processing, the system can handle large volumes of data more efficiently.
- Scalability: The dynamic creation of actors based on resource availability allows the system to scale according to the workload.
Future Work (Potential Follow-up PRs):
- Support parallel batch processing and implement processing_batched for OPs in the pr_demo.yaml process.
- Achieve multi-Actor parallelism on GPUs to improve GPU utilization and SM% (Streaming Multiprocessor percentage).
Experiment:
The end-to-end efficiency optimization for a small amount of videos is shown in the figure below. Increasing the amount of data can make the advantages of parallel processing more obvious.