NeMo-Curator
NeMo-Curator copied to clipboard
find_pii_and_deidentify example fails
trafficstars
I'm trying to run the PII example here.
# for gpu
python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py --device gpu
# for cpu
python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py
On CPU, I get memory warnings and eventual worker deaths without producing output:
2024-05-28 14:41:18,511 - distributed.nanny - WARNING - Restarting worker [180/2695]
2024-05-28 14:41:19 INFO:Loaded recognizer: EmailRecognizer
2024-05-28 14:41:19 INFO:Loaded recognizer: PhoneRecognizer
2024-05-28 14:41:19 INFO:Loaded recognizer: SpacyRecognizer
2024-05-28 14:41:19 INFO:Loaded recognizer: UsSsnRecognizer
2024-05-28 14:41:19 INFO:Loaded recognizer: CreditCardRecognizer
2024-05-28 14:41:19 INFO:Loaded recognizer: IpRecognizer
2024-05-28 14:41:19 WARNING:model_to_presidio_entity_mapping is missing from configuration, using default
2024-05-28 14:41:19 WARNING:low_score_entity_names is missing from configuration, using default
2024-05-28 14:41:22,407 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may in
dicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/lat
est/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.68
GiB -- Worker memory limit: 5.25 GiB
2024-05-28 14:41:23,165 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing wo
rker. Process memory: 4.27 GiB -- Worker memory limit: 5.25 GiB
2024-05-28 14:41:24,134 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:33953 (pid=14243) e
xceeded 95% memory budget. Restarting...
2024-05-28 14:41:24,471 - distributed.scheduler - ERROR - Task ('getitem-modify_document-assign-64f0e480e
2b64dd94f34c05c2de0918e', 0) marked as failed because 4 workers died while trying to run it
2024-05-28 14:41:24,472 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:33953' cause
d the cluster to lose already computed task(s), which will be recomputed elsewhere: {('frompandas-f7a5910
31e0ada9d2c8cba1c8468dd66', 0)} (stimulus_id='handle-worker-cleanup-1716907284.4715889')
Traceback (most recent call last):
File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 52, in <module>
console_script()
File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 48, in console_script
modified_dataset.df.to_json("output_files/*.jsonl", lines=True, orient="records")
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask_expr/_collection.py", line 2380, in to_j
son
return to_json(self, filename, *args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/dataframe/io/json.py", line 96, in to_js
on
return list(dask_compute(*parts, **compute_kwargs))
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py", line 661, in compute
results = schedule(dsk, keys, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/client.py", line 2232, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task ('getitem-modify_document-assign-64f0e480e2b64d
d94f34c05c2de0918e', 0) on 4 different workers, but all those workers died while running it. The last wor
ker that attempt to run the task was tcp://127.0.0.1:33953. Inspecting worker logs is often a good next s
tep to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.h
tml.
2024-05-28 14:41:24,778 - distributed.nanny - WARNING - Restarting worker
2024-05-28 14:41:24,959 - distributed.worker - ERROR - Failed to communicate with scheduler during heartb
eat.
There's a longer trace, but it's just more restarting workers before the cluster shuts down.
In GPU mode, it takes some time before failing with a pytorch error:
python examples/find_pii_and_deidentify.py --device gpu
Traceback (most recent call last):
File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 52, in <module>
console_script()
File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 30, in console_script
_ = get_client(**parse_client_args(arguments))
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 150, in get_client
return start_dask_gpu_local_cluster(
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 75, in start_dask_gpu_local_cluster
_set_torch_to_use_rmm()
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 175, in _set_torch_to_use_rmm
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/torch/cuda/memory.py", line 905, in change_current_allocator
torch._C._cuda_changeCurrentAllocator(allocator.allocator())
AttributeError: module 'torch._C' has no attribute '_cuda_changeCurrentAllocator'