NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

find_pii_and_deidentify example fails

Open randerzander opened this issue 1 year ago • 1 comments
trafficstars

I'm trying to run the PII example here.

# for gpu
python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py --device gpu

# for cpu
python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py

On CPU, I get memory warnings and eventual worker deaths without producing output:

2024-05-28 14:41:18,511 - distributed.nanny - WARNING - Restarting worker                      [180/2695]

2024-05-28 14:41:19 INFO:Loaded recognizer: EmailRecognizer                                              

2024-05-28 14:41:19 INFO:Loaded recognizer: PhoneRecognizer                                              

2024-05-28 14:41:19 INFO:Loaded recognizer: SpacyRecognizer                                              

2024-05-28 14:41:19 INFO:Loaded recognizer: UsSsnRecognizer                                              

2024-05-28 14:41:19 INFO:Loaded recognizer: CreditCardRecognizer                                         

2024-05-28 14:41:19 INFO:Loaded recognizer: IpRecognizer                                                 

2024-05-28 14:41:19 WARNING:model_to_presidio_entity_mapping is missing from configuration, using default

2024-05-28 14:41:19 WARNING:low_score_entity_names is missing from configuration, using default          

2024-05-28 14:41:22,407 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may in

dicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/lat

est/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.68

 GiB -- Worker memory limit: 5.25 GiB                                                                    

2024-05-28 14:41:23,165 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing wo

rker.  Process memory: 4.27 GiB -- Worker memory limit: 5.25 GiB                                         

2024-05-28 14:41:24,134 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:33953 (pid=14243) e

xceeded 95% memory budget. Restarting...                                                                 

2024-05-28 14:41:24,471 - distributed.scheduler - ERROR - Task ('getitem-modify_document-assign-64f0e480e

2b64dd94f34c05c2de0918e', 0) marked as failed because 4 workers died while trying to run it              

2024-05-28 14:41:24,472 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:33953' cause

d the cluster to lose already computed task(s), which will be recomputed elsewhere: {('frompandas-f7a5910

31e0ada9d2c8cba1c8468dd66', 0)} (stimulus_id='handle-worker-cleanup-1716907284.4715889')                 

Traceback (most recent call last):                                                                       

  File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 52, in <module>                   

    console_script()                                                                                     

  File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 48, in console_script             

    modified_dataset.df.to_json("output_files/*.jsonl", lines=True, orient="records")                    

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask_expr/_collection.py", line 2380, in to_j

son                                                                                                      

    return to_json(self, filename, *args, **kwargs)                                                      

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/dataframe/io/json.py", line 96, in to_js

on                                                                                                       

    return list(dask_compute(*parts, **compute_kwargs))                                                  

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py", line 661, in compute          

    results = schedule(dsk, keys, **kwargs)                                                              

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/client.py", line 2232, in _gather

    raise exception.with_traceback(traceback)                                                            

distributed.scheduler.KilledWorker: Attempted to run task ('getitem-modify_document-assign-64f0e480e2b64d

d94f34c05c2de0918e', 0) on 4 different workers, but all those workers died while running it. The last wor

ker that attempt to run the task was tcp://127.0.0.1:33953. Inspecting worker logs is often a good next s

tep to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.h

tml.                                                                                                     

2024-05-28 14:41:24,778 - distributed.nanny - WARNING - Restarting worker                                

2024-05-28 14:41:24,959 - distributed.worker - ERROR - Failed to communicate with scheduler during heartb

eat.

There's a longer trace, but it's just more restarting workers before the cluster shuts down.

In GPU mode, it takes some time before failing with a pytorch error:

python examples/find_pii_and_deidentify.py --device gpu

Traceback (most recent call last):

  File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 52, in <module>

    console_script()

  File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 30, in console_script

    _ = get_client(**parse_client_args(arguments))

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 150, in get_client

    return start_dask_gpu_local_cluster(

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 75, in start_dask_gpu_local_cluster

    _set_torch_to_use_rmm()

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 175, in _set_torch_to_use_rmm

    torch.cuda.memory.change_current_allocator(rmm_torch_allocator)

  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/torch/cuda/memory.py", line 905, in change_current_allocator

    torch._C._cuda_changeCurrentAllocator(allocator.allocator())

AttributeError: module 'torch._C' has no attribute '_cuda_changeCurrentAllocator'

randerzander avatar May 28 '24 14:05 randerzander