NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

`LookupError` not caught during Encoding handling

Open ggcr opened this issue 11 months ago • 4 comments
trafficstars

Describe the bug

In the Data curation for DAPT tutorial (tutorials/dapt-curation) when attempting to decode files with an encoding that is not supported by the system (e.g., Vietnamese's VISCII in this case), the program raises a LookupError, which is not currently being caught in the exception handling. This causes the program to fail unexpectedly and to skip the parsing of the whole repo in this case.

Steps/Code to reproduce bug

I have created a repo that only contains the file that is triggering this error, available here ggcr/nvidia-nemo-error-report. To reproduce, I follow this steps:

  1. Clone NeMo-Curator
$ git clone https://github.com/NVIDIA/NeMo-Curator.git
$ cd NeMo-Curator/
  1. Add the github repo with a standalone file made to reproduce this issue to the list of repos to curate:
$ echo '"ggcr/nvidia-nemo-error-report"' >> tutorials/dapt-curation/code/sources/github_repos.jsonl
  1. Run the tutorial:
$ cd tutorials/dapt-curation/code
$ python3 main.py --n-workers 2

In my case, this run logs the following execution:

Args:  Namespace(device='cpu', files_per_partition=2, n_workers=2, num_files=None, nvlink_only=False, protocol='tcp', rmm_pool_size=None, scheduler_address=None, scheduler_file=None, threads_per_worker=1)
Download directory:  /private/tmp/NeMo-Curator/tutorials/dapt-curation/code/data/raw/wikipedia
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/HVM'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Parallel%20computing'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Number%20Assignment%20Module'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Separation%20of%20concerns'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Operand%20forwarding'...
...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Memory%20rank'...
Traceback (most recent call last):
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 257, in <module>
    main()
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 240, in main
    text_files, code_files = download_sources(100, 100, 100)
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 73, in download_sources
    github_dir = download_github_sources(
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/downloaders.py", line 168, in download_github_sources
    dataset.persist()
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/nemo_curator/datasets/doc_dataset.py", line 38, in persist
    return DocumentDataset(self.df.persist())
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask_expr/_collection.py", line 447, in persist
    return DaskMethodsMixin.persist(out, **kwargs)
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 345, in persist
    (result,) = persist(self, traverse=False, **kwargs)
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 999, in persist
    results = schedule(dsk, keys, **kwargs)
  File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/nemo_curator/download/doc_builder.py", line 127, in _download_and_extract_single_partition
    for item in iterator.iterate(downloaded_file):
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py", line 335, in iterate
    parsed = self.parse_file(zip_ref, file_info)
  File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py", line 285, in parse_file
    content = content.decode(encoding)
LookupError: unknown encoding: VISCII

Proposed solution

In the current implementation of parse_file, the exception handling only catches UnicodeDecodeError.

https://github.com/NVIDIA/NeMo-Curator/blob/7272ca04c2ec2255203c430412798071444e8bb4/tutorials/dapt-curation/code/docbuilder.py#L275-L288

This can be updated to also catch LookupError.

except (UnicodeDecodeError, LookupError):
    return None

Environment overview (please complete the following information)

  • Environment location: Local (MacBook M3)
  • Method of NeMo-Curator install: conda create new env with python 3.10 and pip install

Environment details

  • OS version: macOS 14.3
  • Dask version: dask 2024.12.0
  • Python version: Python 3.10.15

ggcr avatar Dec 06 '24 06:12 ggcr