Add CPU Dask DataFrame support for `DistributedDataClassifier`
Currently, when trying out this notebook with a CPU Dask DataFrame, it fails with a TypeError: batch_text_or_text_pairs has to be a list or a tuple (got <class 'pandas.core.series.Series'>).
To reproduce, use the linked notebook, add
import pandas as pd
import dask.dataframe as dd
and replace
df = cudf.DataFrame({"text": text})
input_dataset = DocumentDataset(dask_cudf.from_cudf(df, npartitions=1))
with
input_dataset = DocumentDataset(dd.from_pandas(pd.DataFrame({"text": text}), npartitions=1))
I will start scoping this bug, as it is also related to https://github.com/NVIDIA/NeMo-Curator/issues/79.
cc @ayushdg @ryantwolf @VibhuJawa
https://github.com/rapidsai/crossfit/pull/76 adds support for CPU Dask DataFrames, as long as you're working on a machine with GPUs available...
For a machine without GPUs available, we can't use CrossFit. I think we can still do a non-CrossFit implementation similar to what we used to have, though. I will continue working on this and see how it goes.
For a machine without GPUs available, we can't use CrossFit. I think we can still do a non-CrossFit implementation similar to what we used to have, though. I will continue working on this and see how it goes.
I am not sure if this is a great use of our time right now because dont think we should spend time exploring Deep Learning models on CPU .
I am not sure if this is a great use of our time right now because dont think we should spend time exploring Deep Learning models on CPU .
Ok, can definitely put this on the backburner for now.
Won't Fix, closing this Bug.