NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

DocumentDataset read errors when other files are present in directory

Open ronjer30 opened this issue 1 year ago • 0 comments
trafficstars

Describe the bug

DocumentDataset.read_parquet and DocumentDataset.read_json fail with unrelated errors when reading directories that also contain files other than JSONL or Parquet. For example, Apache Spark jobs that write data to a directory typically include markers and CRC files. When using DocumentDataset to read from these directories, the error reported is NotADirectoryError: [Errno 20] Not a directory: '._SUCCESS.crc' which can be misleading to the user.

Steps/Code to reproduce bug

  • Create an empty file with a .CRC or other extension (not JSONL/parquet) and place in data directory along with other JSONL or Parquet files
  • Running this will fail
input_files = get_all_files_paths_under(<directory>)
input_dataset = DocumentDataset.read_parquet(input_files, backend='cudf')

and returns an error similar to NotADirectoryError: [Errno 20] Not a directory: '._SUCCESS.crc'

However, the input to the read_parquet method is actually a list of files

Expected behavior

DocumentDataset read methods could either ignore these markers or report a more informative error message

**Environment overview **

  • Environment location: Docker
  • Method of NeMo-Curator install: [pip install or from source]. Please specify exact commands you used to install.
docker run \
   --rm \
   -it \
   --gpus '"device=1"' \
   --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
   -p 8888:8888 \
   -p 8787:8787 \
   nvcr.io/nvidia/nemo:dev

ronjer30 avatar Aug 22 '24 23:08 ronjer30