data-juicer
data-juicer copied to clipboard
[BUG]: inappropriate arguments for `map_batches` in ray mode
For now, running Data-Juicer on multiple nodes in "ray" mode, which uses map_batches to process datasets, might cause some implicit problems.
The map_batches method has two arguments, num_gpus and concurrency, which are actually cluster-level arguments. However, they are calculated automatically according to the hardware information of a single machine. So, there might be some resource utilization problems when running on multiple nodes for OPs with _accelerator is "cuda".