Curator Add CPU Dask DataFrame support for `DistributedDataClassifier`

Currently, when trying out this notebook with a CPU Dask DataFrame, it fails with a TypeError: batch_text_or_text_pairs has to be a list or a tuple (got <class 'pandas.core.series.Series'>).

To reproduce, use the linked notebook, add

import pandas as pd
import dask.dataframe as dd

and replace

df = cudf.DataFrame({"text": text})
input_dataset = DocumentDataset(dask_cudf.from_cudf(df, npartitions=1))

with

input_dataset = DocumentDataset(dd.from_pandas(pd.DataFrame({"text": text}), npartitions=1))

I will start scoping this bug, as it is also related to https://github.com/NVIDIA/NeMo-Curator/issues/79.

cc @ayushdg @ryantwolf @VibhuJawa

Aug 08 '24 21:08 sarahyurick

https://github.com/rapidsai/crossfit/pull/76 adds support for CPU Dask DataFrames, as long as you're working on a machine with GPUs available...

For a machine without GPUs available, we can't use CrossFit. I think we can still do a non-CrossFit implementation similar to what we used to have, though. I will continue working on this and see how it goes.

Aug 13 '24 20:08 sarahyurick

For a machine without GPUs available, we can't use CrossFit. I think we can still do a non-CrossFit implementation similar to what we used to have, though. I will continue working on this and see how it goes.

I am not sure if this is a great use of our time right now because dont think we should spend time exploring Deep Learning models on CPU .

Aug 13 '24 21:08 VibhuJawa

I am not sure if this is a great use of our time right now because dont think we should spend time exploring Deep Learning models on CPU .

Ok, can definitely put this on the backburner for now.

Aug 13 '24 21:08 sarahyurick

Won't Fix, closing this Bug.

Jan 22 '25 19:01 sithape2025