NeMo-Curator
NeMo-Curator copied to clipboard
Pii Modifier should work with `DocumentDataset` on cudf
trafficstars
Is your feature request related to a problem? Please describe.
(not urgent since we anyway have to spill to host memory, but we might benefit from faster I/O and dataset filtering e.g. in #417 )
Noticed an oddity in the PII examples / scripts / docs that PII doesn't work when we do DocDataset.read_*(backend="cudf") Given that
All of the examples / scripts / docs do a read dataset using dask (pandas) but to the Modifier pass in device='gpu'
Describe the solution you'd like
The code works with DocumentDataset('cudf')
I think we might just need to_pyarrow().tolist() when series is cudf type