NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

`DocumentDataset` bug for reading relative file paths

Open sarahyurick opened this issue 1 year ago • 0 comments
trafficstars

As far as I am aware, this bug happens only when running interactively on a Jupyter Notebook connected to a Dask cluster; when running a regular Jupyter Notebook with a LocalCluster everything works fine.

Steps to reproduce:

  1. Edit container-entrypoint.sh to
# Extract the "address" value using jq and export it as an environment variable
export SCHEDULER_ADDRESS=$(jq -r '.address' "$SCHEDULER_FILE")
echo "SCHEDULER_ADDRESS=$SCHEDULER_ADDRESS"

bash -c "pip install jupyterlab"

if [[ -z "$SLURM_NODEID" ]] || [[ $SLURM_NODEID == 0 ]]; then
  echo "Starting notebook"
  bash -c "jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.password='' --notebook-dir=${BASE_DIR}"
  touch $DONE_MARKER
fi
  1. Run start-slurm.sh. No need to specify SCRIPT_PATH and SCRIPT_COMMAND.
  2. In the Jupyter notebook, this fails:
import os
from dask.distributed import Client
from nemo_curator.utils.distributed_utils import get_num_workers, read_data
from nemo_curator.datasets import DocumentDataset

scheduler_address = os.getenv("SCHEDULER_ADDRESS")
client = Client(address=scheduler_address)
print(f"Num Workers = {get_num_workers(client)}", flush=True)

dataset = DocumentDataset(
    read_data(
        ["./df_test1.jsonl"],
        file_type="jsonl",
        backend="pandas",
    )
)
dataset.df.head()

This also fails:

dataset = DocumentDataset.read_json(
    "./df_test1.jsonl",
    add_filename=True,
)
dataset.df.head()

The full error message:

Traceback (most recent call last):
  File "/usr/local/bin/text_cleaning", line 8, in <module>
    sys.exit(console_script())
  File "/opt/NeMo-Curator/nemo_curator/scripts/text_cleaning.py", line 93, in console_script
    main(attach_args().parse_args())
  File "/opt/NeMo-Curator/nemo_curator/scripts/text_cleaning.py", line 50, in main
    write_to_disk(
  File "/opt/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 505, in write_to_disk
    output = output.compute()
  File "/usr/local/lib/python3.10/dist-packages/dask/base.py", line 375, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/io/io.py", line 842, in __call__
    return self.func(
  File "/opt/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 267, in read_single_partition
    df = read_f(file, **read_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py", line 815, in read_json
    return json_reader.read()
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py", line 1023, in read
    obj = self._get_object_parser(self._combine_lines(data_lines))
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py", line 1187, in parse
    self._parse()
  File "/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py", line 1403, in _parse
    ujson_loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value

To fix it, the user has to use the absolute path to the data instead of the relative path. This behavior is unexpected because

import dask.dataframe as dd

df = dd.read_json("./df_test1.jsonl", lines=True)
df.head()

works fine no matter what. Thus, the bug appears to be with the DocumentDataset(read_data()) and DocumentDataset.read_json() functions not handling relative paths correctly.

sarahyurick avatar Aug 12 '24 23:08 sarahyurick