NeMo-Curator
NeMo-Curator copied to clipboard
DocumentDataset read errors when other files are present in directory
Describe the bug
DocumentDataset.read_parquet and DocumentDataset.read_json fail with unrelated errors when reading directories that also contain files other than JSONL or Parquet. For example, Apache Spark jobs that write data to a directory typically include markers and CRC files. When using DocumentDataset to read from these directories, the error reported is NotADirectoryError: [Errno 20] Not a directory: '._SUCCESS.crc' which can be misleading to the user.
Steps/Code to reproduce bug
- Create an empty file with a .CRC or other extension (not JSONL/parquet) and place in data directory along with other JSONL or Parquet files
- Running this will fail
input_files = get_all_files_paths_under(<directory>)
input_dataset = DocumentDataset.read_parquet(input_files, backend='cudf')
and returns an error similar to
NotADirectoryError: [Errno 20] Not a directory: '._SUCCESS.crc'
However, the input to the read_parquet method is actually a list of files
Expected behavior
DocumentDataset read methods could either ignore these markers or report a more informative error message
**Environment overview **
- Environment location: Docker
- Method of NeMo-Curator install: [pip install or from source]. Please specify exact commands you used to install.
docker run \
--rm \
-it \
--gpus '"device=1"' \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-p 8888:8888 \
-p 8787:8787 \
nvcr.io/nvidia/nemo:dev