NeMo-Curator
NeMo-Curator copied to clipboard
`LookupError` not caught during Encoding handling
Describe the bug
In the Data curation for DAPT tutorial (tutorials/dapt-curation) when attempting to decode files with an encoding that is not supported by the system (e.g., Vietnamese's VISCII in this case), the program raises a LookupError, which is not currently being caught in the exception handling. This causes the program to fail unexpectedly and to skip the parsing of the whole repo in this case.
Steps/Code to reproduce bug
I have created a repo that only contains the file that is triggering this error, available here ggcr/nvidia-nemo-error-report. To reproduce, I follow this steps:
- Clone NeMo-Curator
$ git clone https://github.com/NVIDIA/NeMo-Curator.git
$ cd NeMo-Curator/
- Add the github repo with a standalone file made to reproduce this issue to the list of repos to curate:
$ echo '"ggcr/nvidia-nemo-error-report"' >> tutorials/dapt-curation/code/sources/github_repos.jsonl
- Run the tutorial:
$ cd tutorials/dapt-curation/code
$ python3 main.py --n-workers 2
In my case, this run logs the following execution:
Args: Namespace(device='cpu', files_per_partition=2, n_workers=2, num_files=None, nvlink_only=False, protocol='tcp', rmm_pool_size=None, scheduler_address=None, scheduler_file=None, threads_per_worker=1)
Download directory: /private/tmp/NeMo-Curator/tutorials/dapt-curation/code/data/raw/wikipedia
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/HVM'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Parallel%20computing'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Number%20Assignment%20Module'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Separation%20of%20concerns'...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Operand%20forwarding'...
...
Downloading txt URLs data from 'https://en.wikipedia.org/wiki/Memory%20rank'...
Traceback (most recent call last):
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 257, in <module>
main()
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 240, in main
text_files, code_files = download_sources(100, 100, 100)
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 73, in download_sources
github_dir = download_github_sources(
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/downloaders.py", line 168, in download_github_sources
dataset.persist()
File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/nemo_curator/datasets/doc_dataset.py", line 38, in persist
return DocumentDataset(self.df.persist())
File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask_expr/_collection.py", line 447, in persist
return DaskMethodsMixin.persist(out, **kwargs)
File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 345, in persist
(result,) = persist(self, traverse=False, **kwargs)
File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 999, in persist
results = schedule(dsk, keys, **kwargs)
File "/Users/ggcr/miniforge3/envs/nemo/lib/python3.10/site-packages/nemo_curator/download/doc_builder.py", line 127, in _download_and_extract_single_partition
for item in iterator.iterate(downloaded_file):
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py", line 335, in iterate
parsed = self.parse_file(zip_ref, file_info)
File "/private/tmp/NeMo-Curator/tutorials/dapt-curation/code/docbuilder.py", line 285, in parse_file
content = content.decode(encoding)
LookupError: unknown encoding: VISCII
Proposed solution
In the current implementation of parse_file, the exception handling only catches UnicodeDecodeError.
https://github.com/NVIDIA/NeMo-Curator/blob/7272ca04c2ec2255203c430412798071444e8bb4/tutorials/dapt-curation/code/docbuilder.py#L275-L288
This can be updated to also catch LookupError.
except (UnicodeDecodeError, LookupError):
return None
Environment overview (please complete the following information)
- Environment location: Local (MacBook M3)
- Method of NeMo-Curator install: conda create new env with python 3.10 and pip install
Environment details
- OS version: macOS 14.3
- Dask version: dask 2024.12.0
- Python version: Python 3.10.15