When I do fineweb-edu for classifier scoring, how do I overlap the tokenizer with the process of model infer?
I find that when tokenize, gpu utilization is always zero
Thanks for raising @HuaYZhao. In our current setup I don't think there's an easy way to pipeline or overlap the tokenization with the inference but we are looking into other approaches internally on how we can pipeline this better an improve throughput.
There might be a way to pipeline by spinning up both CPU (dask worker) and GPU (dask-cuda) based workers with different resource annotations and manually annotating the resource type in the tokenization and inference but it isn't straightforward to do in the current setup.
Will there be an update on this in the future or soon?
Will there be an update on this in the future or soon?
We plan to enable this in the mid to long term (in a few months). It is not planned for the short term.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
We will achieve this with Ray! Tokenization will be entirely on the CPU and model inference entirely on the GPU. Multiple batches of data will flow through the pipeline at a time, so when one batch finishes tokenization on the CPU, it can immediately be handed off to inference on the GPU.
This will be closed by: https://github.com/NVIDIA-NeMo/Curator/pull/753.