NeMo-Curator
NeMo-Curator copied to clipboard
`DocumentDataset` bug for reading relative file paths
trafficstars
As far as I am aware, this bug happens only when running interactively on a Jupyter Notebook connected to a Dask cluster; when running a regular Jupyter Notebook with a LocalCluster everything works fine.
Steps to reproduce:
- Edit container-entrypoint.sh to
# Extract the "address" value using jq and export it as an environment variable
export SCHEDULER_ADDRESS=$(jq -r '.address' "$SCHEDULER_FILE")
echo "SCHEDULER_ADDRESS=$SCHEDULER_ADDRESS"
bash -c "pip install jupyterlab"
if [[ -z "$SLURM_NODEID" ]] || [[ $SLURM_NODEID == 0 ]]; then
echo "Starting notebook"
bash -c "jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.password='' --notebook-dir=${BASE_DIR}"
touch $DONE_MARKER
fi
- Run start-slurm.sh. No need to specify
SCRIPT_PATHandSCRIPT_COMMAND. - In the Jupyter notebook, this fails:
import os
from dask.distributed import Client
from nemo_curator.utils.distributed_utils import get_num_workers, read_data
from nemo_curator.datasets import DocumentDataset
scheduler_address = os.getenv("SCHEDULER_ADDRESS")
client = Client(address=scheduler_address)
print(f"Num Workers = {get_num_workers(client)}", flush=True)
dataset = DocumentDataset(
read_data(
["./df_test1.jsonl"],
file_type="jsonl",
backend="pandas",
)
)
dataset.df.head()
This also fails:
dataset = DocumentDataset.read_json(
"./df_test1.jsonl",
add_filename=True,
)
dataset.df.head()
The full error message:
Traceback (most recent call last):
File "/usr/local/bin/text_cleaning", line 8, in <module>
sys.exit(console_script())
File "/opt/NeMo-Curator/nemo_curator/scripts/text_cleaning.py", line 93, in console_script
main(attach_args().parse_args())
File "/opt/NeMo-Curator/nemo_curator/scripts/text_cleaning.py", line 50, in main
write_to_disk(
File "/opt/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 505, in write_to_disk
output = output.compute()
File "/usr/local/lib/python3.10/dist-packages/dask/base.py", line 375, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/dask/base.py", line 661, in compute
results = schedule(dsk, keys, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/io/io.py", line 842, in __call__
return self.func(
File "/opt/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 267, in read_single_partition
df = read_f(file, **read_kwargs)
File "/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py", line 815, in read_json
return json_reader.read()
File "/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py", line 1023, in read
obj = self._get_object_parser(self._combine_lines(data_lines))
File "/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser
obj = FrameParser(json, **kwargs).parse()
File "/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py", line 1187, in parse
self._parse()
File "/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py", line 1403, in _parse
ujson_loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value
To fix it, the user has to use the absolute path to the data instead of the relative path. This behavior is unexpected because
import dask.dataframe as dd
df = dd.read_json("./df_test1.jsonl", lines=True)
df.head()
works fine no matter what. Thus, the bug appears to be with the DocumentDataset(read_data()) and DocumentDataset.read_json() functions not handling relative paths correctly.