NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Pii Modifier should work with `DocumentDataset` on cudf

Open praateekmahajan opened this issue 11 months ago • 0 comments
trafficstars

Is your feature request related to a problem? Please describe.

(not urgent since we anyway have to spill to host memory, but we might benefit from faster I/O and dataset filtering e.g. in #417 )

Noticed an oddity in the PII examples / scripts / docs that PII doesn't work when we do DocDataset.read_*(backend="cudf") Given that

  1. We call a text.tolist() here
  2. And cudf.Series doesn't have support tolist() (here)

All of the examples / scripts / docs do a read dataset using dask (pandas) but to the Modifier pass in device='gpu'

Describe the solution you'd like The code works with DocumentDataset('cudf') I think we might just need to_pyarrow().tolist() when series is cudf type

praateekmahajan avatar Dec 10 '24 16:12 praateekmahajan