NeMo-Curator tutorials/peft-curation-with-sdg Failing with Runtime Exceptions.

We have attempted to run tutorials/peft-curation-with-sdg and facing runtime errors, details are mentioned below with the environment setup information we tried.

python ./main.py \
	--api-key <Token From: https://build.nvidia.com/nvidia/nemotron-4-340b-instruct?integrate_nim=true&hosted_api=true > \
	--device gpu \
	--synth-gen-rounds 1 \
	--synth-gen-ratio 0.001 \
	--synth-gen-model "nvidia/nemotron-4-340b-instruct"

Environment Setup

conda create --name nemo python==3.10.12
conda activate nemo

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install nemo_toolkit['all']

git clone https://github.com/NVIDIA/NeMo.git

cd ./NeMo

apt install gcc-12 g++-12

# Then you need to set this version as default one:
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 1

conda activate nemo

cd ./NeMo

python ./setup.py install

conda install rapidsai::cudf
conda install rapidsai::dask-cudf
conda install rapidsai::cuml

pip install annotated-types
pip install pydantic-core
pip install httpcore
pip install lxml-html-clean

Errors

Errors for Dataset: https://huggingface.co/datasets/ymoslem/MedicalSciences-StackExchange/resolve/main/medical.stackexchange-questions-answers.json?download=true

Key:       ('lambda-cfabe91fdbaed95ddbb48379ec3121d3', 9)
State:     executing
Function:  execute_task
args:      ((<function reify at 0x7d3869cdfbe0>, (<function map_chunk at 0x7d3869cfc040>, <function SemanticClusterLevelDedup.compute_semantic_match_dfs.<locals>.<lambda> at 0x7d36e7f8fbe0>, [[9]], None, {})))
kwargs:    {}
Exception: 'ValueError("invalid literal for int() with base 10: \'law-stackexchange-qa-9905\'")'
Traceback: '  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 1868, in reify\n    seq = list(seq)\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 2056, in __next__\n    return self.f(*vals)\n  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 456, in <lambda>\n    lambda cluster_id: get_semantic_matches_per_cluster(\n  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/utils/semdedup_utils.py", line 252, in get_semantic_matches_per_cluster\n    text_ids = cluster_i[:, 0].astype(id_col_type)\n'

2024-08-14 12:04:51,964 - distributed.worker - WARNING - Compute Failed
Key:       ('lambda-cfabe91fdbaed95ddbb48379ec3121d3', 8)
State:     executing
Function:  execute_task
args:      ((<function reify at 0x7d3869cdfbe0>, (<function map_chunk at 0x7d3869cfc040>, <function SemanticClusterLevelDedup.compute_semantic_match_dfs.<locals>.<lambda> at 0x7d36f199f490>, [[8]], None, {})))
kwargs:    {}
Exception: 'ValueError("invalid literal for int() with base 10: \'law-stackexchange-qa-19252\'")'
Traceback: '  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 1868, in reify\n    seq = list(seq)\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 2056, in __next__\n    return self.f(*vals)\n  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 456, in <lambda>\n    lambda cluster_id: get_semantic_matches_per_cluster(\n  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/utils/semdedup_utils.py", line 252, in get_semantic_matches_per_cluster\n    text_ids = cluster_i[:, 0].astype(id_col_type)\n'

Traceback (most recent call last):
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 415, in <module>
    main()
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 406, in main
    train_fp_curated = run_pipeline(args, train_fp)
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 310, in run_pipeline
    dataset_df, n_rows_before, n_rows_after = run_curation_pipeline(args, jsonl_fp)
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 227, in run_curation_pipeline
    dataset = gpu_curation_steps(dataset)
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/meta.py", line 22, in __call__
    dataset = module(dataset)
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 141, in semantic_dedupe
    dedup_ids = semdup(dataset)
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 570, in __call__
    self.semantic_cluster_dedup.compute_semantic_match_dfs(self.eps_thresholds)
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 467, in compute_semantic_match_dfs
    tasks.compute()
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 376, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 662, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 1868, in reify
    seq = list(seq)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 2056, in __next__
    return self.f(*vals)
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 456, in <lambda>
    lambda cluster_id: get_semantic_matches_per_cluster(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/utils/semdedup_utils.py", line 252, in get_semantic_matches_per_cluster
    text_ids = cluster_i[:, 0].astype(id_col_type)
ValueError: invalid literal for int() with base 10: 'law-stackexchange-qa-9905'
ERROR conda.cli.main_run:execute(125): `conda run python /data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py --api-key Ynvapi-iGgFsXrdl4FkGA7MFPzNgL3r8o2c_dcZ7SGT8Yx7ZlYNWNix2xnX8IezbxTVbcPW --device gpu --synth-gen-rounds 1 --synth-gen-ratio 0.001 --synth-gen-model nvidia/nemotron-4-340b-instruct` failed. (See above for error)

Errors for Dataset: https://huggingface.co/datasets/ymoslem/Law-StackExchange/resolve/main/law-stackexchange-questions-answers.json

future: <Task finished name='Task-8451' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py:75> exception=AuthenticationError('Error code: 401')>
Traceback (most recent call last):
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
    return i, await f
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/synthetic_gen.py", line 221, in _prompt_model
    gen_title, gen_question, gen_answer = await asyncio.gather(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 421, in generate_closed_qa_instructions
    openline_response = await self._prompt(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 77, in _prompt
    return await self.client.query_model(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/services/openai_client.py", line 123, in query_model
    response = await self.client.chat.completions.create(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/resources/chat/completions.py", line 1339, in create
    return await self._post(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1815, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1509, in request
    return await self._request(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1610, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.AuthenticationError: Error code: 401
Task exception was never retrieved
future: <Task finished name='Task-8452' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py:75> exception=AuthenticationError('Error code: 401')>
Traceback (most recent call last):
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
    return i, await f
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/synthetic_gen.py", line 221, in _prompt_model
    gen_title, gen_question, gen_answer = await asyncio.gather(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 421, in generate_closed_qa_instructions
    openline_response = await self._prompt(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 77, in _prompt
    return await self.client.query_model(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/services/openai_client.py", line 123, in query_model
    response = await self.client.chat.completions.create(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/resources/chat/completions.py", line 1339, in create
    return await self._post(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1815, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1509, in request
    return await self._request(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1610, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.AuthenticationError: Error code: 401
Task exception was never retrieved
future: <Task finished name='Task-8455' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py:75> exception=AuthenticationError('Error code: 401')>
Traceback (most recent call last):
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
    return i, await f
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/synthetic_gen.py", line 221, in _prompt_model
    gen_title, gen_question, gen_answer = await asyncio.gather(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 421, in generate_closed_qa_instructions
    openline_response = await self._prompt(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 77, in _prompt
    return await self.client.query_model(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/services/openai_client.py", line 123, in query_model
    response = await self.client.chat.completions.create(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/resources/chat/completions.py", line 1339, in create
    return await self._post(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1815, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1509, in request
    return await self._request(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1610, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.AuthenticationError: Error code: 401
Task exception was never retrieved
future: <Task finished name='Task-8458' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py:75> exception=AuthenticationError('Error code: 401')>
Traceback (most recent call last):
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
    return i, await f
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/synthetic_gen.py", line 221, in _prompt_model
    gen_title, gen_question, gen_answer = await asyncio.gather(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 421, in generate_closed_qa_instructions
    openline_response = await self._prompt(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 77, in _prompt
    return await self.client.query_model(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/services/openai_client.py", line 123, in query_model
    response = await self.client.chat.completions.create(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/resources/chat/completions.py", line 1339, in create
    return await self._post(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1815, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1509, in request
    return await self._request(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1610, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.AuthenticationError: Error code: 401

Errors for Dataset: Created a custom dataset based on law-stackexchange-questions-answers.json with one single record.

Reading 1 files
Traceback (most recent call last):
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'title'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 415, in <module>
    main()
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 406, in main
    train_fp_curated = run_pipeline(args, train_fp)
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 310, in run_pipeline
    dataset_df, n_rows_before, n_rows_after = run_curation_pipeline(args, jsonl_fp)
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 207, in run_curation_pipeline
    dataset = cpu_curation_steps(dataset)
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/meta.py", line 22, in __call__
    dataset = module(dataset)
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/modify.py", line 31, in __call__
    dataset.df[self.text_field] = dataset.df[self.text_field].apply(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/dataframe/core.py", line 4951, in __getitem__
    meta = self._meta[_extract_meta(key)]
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'title'
ERROR conda.cli.main_run:execute(125): `conda run python /data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py --api-key Ynvapi-iGgFsXrdl4FkGA7MFPzNgL3r8o2c_dcZ7SGT8Yx7ZlYNWNix2xnX8IezbxTVbcPW --device gpu --synth-gen-rounds 1 --synth-gen-ratio 0.001 --synth-gen-model nvidia/nemotron-4-340b-instruct` failed. (See above for error)

Errors for Dataset: Created a custom dataset based on law-stackexchange-questions-answers.json with 100k+ entries.

Reading 1 files
/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
2024-08-14 14:13:03,126 - distributed.worker - WARNING - Compute Failed
Key:       ('to-parquet-fc0472286679d49f53313f93027e9065', 0)
State:     executing
Function:  subgraph_callable-5e534f6954054123de078e717e9367e9
args:      ((0,), '<crossfit.backend.torch.op.base.Predictor object a-18befc5a9c8e11d3a125f4e2f611e20c', {'number': 0, 'division': None}, '<crossfit.op.tokenize.Tokenizer object at 0x7cca6c-3510ecebf68286e44567efef058f7f2b', 'to_cudf_dispatch-25a6cae5e12e52fe804538d37bb046b3', 'assign-42e4fd5d877ed0406e2080192bc11744', 'text', 'answer', '\n', 'title', 'question', <bound method FilterLowScores.keep_document of <filters.FilterLowScores object at 0x737a3c503310>>, 'apply-f00d3b1f0bc5365043d5ad449cc081ad', <bound method FilterLowScores.score_document of <filters.FilterLowScores object at 0x737a3c503310>>, 'getitem-340ac611e67c7235a472afe0fa6529fa', 'answer_score', <bound method FilterLowScores.keep_document of <filters.FilterLowScores object at 0x737a3c503370>>, 'apply-5d07bbb647c3e1c3ea837331d3b646b3', <bound method FilterLowScores.score_document of <filters.FilterLowScores object at 0x737a3c503370>>, 'getitem-ad0338cf88d799fa6c5767bd2922a2f4', 'question_score', <bound method WordCountFilter.keep_do
kwargs:    {}
Exception: "IndexError('list index out of range')"
Traceback: '  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/optimization.py", line 1001, in __call__\n    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 157, in get\n    result = _execute_task(task, cache)\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in _execute_task\n    return func(*(_execute_task(a, cache) for a in args))\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in <genexpr>\n    return func(*(_execute_task(a, cache) for a in args))\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in _execute_task\n    return func(*(_execute_task(a, cache) for a in args))\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in <genexpr>\n    return func(*(_execute_task(a, cache) for a in args))\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 121, in _execute_task\n    return [_execute_task(a, cache) for a in arg]\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 121, in <listcomp>\n    return [_execute_task(a, cache) for a in arg]\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in _execute_task\n    return func(*(_execute_task(a, cache) for a in args))\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in <genexpr>\n    return func(*(_execute_task(a, cache) for a in args))\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 121, in _execute_task\n    return [_execute_task(a, cache) for a in arg]\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 121, in <listcomp>\n    return [_execute_task(a, cache) for a in arg]\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in _execute_task\n    return func(*(_execute_task(a, cache) for a in args))\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/utils.py", line 78, in apply\n    return func(*args, **kwargs)\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/dataframe/core.py", line 7164, in apply_and_enforce\n    df = func(*args, **kwargs)\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/dataframe/core.py", line 7073, in func\n    return orig_func(*args, **kwargs, partition_info=partition_info)\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/base.py", line 96, in __call__\n    output = self.call(data, *args, **kwargs)\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 152, in call\n    input_ids, attention_mask = self.call_column(data[col])\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 117, in call_column\n    tokenized_data = self.tokenize_strings(text).copy()\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 68, in tokenize_strings\n    tokenized_data = tokenizer.batch_encode_plus(\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3338, in batch_encode_plus\n    return self._batch_encode_plus(\n  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 561, in _batch_encode_plus\n    for key in tokens_and_encodings[0][0].keys():\n'

Traceback (most recent call last):
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 415, in <module>
    main()
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 406, in main
    train_fp_curated = run_pipeline(args, train_fp)
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 310, in run_pipeline
    dataset_df, n_rows_before, n_rows_after = run_curation_pipeline(args, jsonl_fp)
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 227, in run_curation_pipeline
    dataset = gpu_curation_steps(dataset)
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/meta.py", line 22, in __call__
    dataset = module(dataset)
  File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 141, in semantic_dedupe
    dedup_ids = semdup(dataset)
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 568, in __call__
    embeddings_dataset = self.embedding_creator(dataset)
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 202, in __call__
    write_to_disk(
  File "/data/ops/roshan/remote/nemo-curator/nemo_curator/utils/distributed_utils.py", line 519, in write_to_disk
    df.to_parquet(output_file_dir, write_index=False)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
    return func(*args, **kwargs)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask_cudf/core.py", line 283, in to_parquet
    return to_parquet(self, path, *args, **kwargs)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py", line 1047, in to_parquet
    out = out.compute(**compute_kwargs)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 376, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 662, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/base.py", line 96, in __call__
    output = self.call(data, *args, **kwargs)
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 152, in call
    input_ids, attention_mask = self.call_column(data[col])
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 117, in call_column
    tokenized_data = self.tokenize_strings(text).copy()
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 68, in tokenize_strings
    tokenized_data = tokenizer.batch_encode_plus(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3338, in batch_encode_plus
    return self._batch_encode_plus(
  File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 561, in _batch_encode_plus
    for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range
ERROR conda.cli.main_run:execute(125): `conda run python /data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py --api-key Ynvapi-iGgFsXrdl4FkGA7MFPzNgL3r8o2c_dcZ7SGT8Yx7ZlYNWNix2xnX8IezbxTVbcPW --device gpu --synth-gen-rounds 1 --synth-gen-ratio 0.001 --synth-gen-model nvidia/nemotron-4-340b-instruct` failed. (See above for error)

Aug 14 '24 15:08 praxi-roshan

Hi @praxi-roshan, thanks for opening this issue and providing detailed repro steps. I just attempted running the tutorial on the latest main commit with a freshly setup environment and it seems to work fine for me.

What I tried:

python3 --version # mine was 3.10.13
python3 -m venv ./venv
source ./venv/bin/activate
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator
pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"
python tutorials/peft-curation-with-sdg/main.py --api-key <my_key> --device gpu

A few questions for you, so I can better assist you:

Did you happen to try running the tutorial without any modifications to see if works at all for you?
On that note, did you try running without providing an API key to see whether the regular data curation bits work as expected? I see some authentication errors, which may suggest your build.nvidia.com key may not have worked properly.
Similarly, did you happen to try with --device cpu to further isolate the scope of the issue?
I noticed that you included your command for installing the NeMo framework (pip install nemo_toolkit['all']), but I didn't quite figure out how the NeMo Curator was installed. How was it installed?

Aug 30 '24 17:08 Maghoumi

Hi @Maghoumi,

Thank you for the instructions, i tried installing based on your steps I'm getting a different error now: OSError: libcudart.so: cannot open shared object file: No such file or directory

I will troubleshoot and provide more feedback.

4. I noticed that you included your command for installing the NeMo framework (pip install nemo_toolkit['all']), but I didn't quite figure out how the NeMo Curator was installed. How was it installed?

Apologies in the post i mentioned wrong repo...! (https://github.com/NVIDIA/NeMo.git), This is how i installed

conda create --name nemo python==3.10.12
conda activate nemo

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install nemo_toolkit['all']

git clone https://github.com/NVIDIA/NeMo-Curator.git

cd ./NeMo-Curator

apt install gcc-12 g++-12

# Then you need to set this version as default one:
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 1

conda activate nemo

cd ./NeMo-Curator

python ./setup.py install

conda install rapidsai::cudf
conda install rapidsai::dask-cudf
conda install rapidsai::cuml

pip install annotated-types
pip install pydantic-core
pip install httpcore
pip install lxml-html-clean

Sep 06 '24 12:09 praxi-roshan

Thank you for the instructions, i tried installing based on your steps I'm getting a different error now: OSError: libcudart.so: cannot open shared object file: No such file or directory

At which stage of running the tutorial do you encounter this issue? This error typically indicates an issue with CUDA installation and would mean the CUDA library files were not found. Since this is OS dependent, I would advise ensuring CUDA is properly installed in your environment before running the code. I often tend to use the following snippet to ensure CUDA is installed properly when I setup new environments (assuming that you have PyTorch installed):

import torch
print(torch.cuda.is_available())  # Should print True if CUDA is installed properly

Looking at your install instructions, I see you have:

conda activate nemo
cd ./NeMo-Curator
python ./setup.py install  # < -- have you tried `pip install .` instead?

Asking because pip install . should ensure dependencies are also handled properly.

Sep 06 '24 15:09 Maghoumi

Hi,

I tried on a new Ubuntu 23.10 Server environment with the below mentioned environment setup, and apparently it worked without any errors and created the curated files.

I have attached the complete log file of the curation process for your reference to confirm that it worked fine.

logs.log

Looks like synthetic data generation not possible without an API key, could you please let us known is there a workaround that we can generate synthetic data without an API Key?
Is there a way we can generate synthetic data without OpenAI client and using a locally downloaded LLM model?

conda create --name nemo python==3.10.13

conda deactivate
conda activate nemo

pip install Cython packaging

# Installed this to resolve this error: "OSError: libcudart.so: cannot open shared object file: No such file or directory"
conda install cudatoolkit=11.8

git clone https://github.com/NVIDIA/NeMo-Curator.git

cd ./NeMo-Curator

apt install gcc-12 g++-12

# Then you need to set this version as default one:
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 1

pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"

python tutorials/peft-curation-with-sdg/main.py --api-key <my_key> --device gpu**

Sep 10 '24 08:09 praxi-roshan

Glad to hear that you were able to follow along after using another environment. Regarding your questions:

Looks like synthetic data generation not possible without an API key, could you please let us known is there a workaround that we can generate synthetic data without an API Key?

As you may have noticed already, currently the framework assumes that the LLM model that can generate synthetic data is accessible via some sort of an HTTP endpoint, which is either compatible with the OpenAI API, or the NeMo Framework API.

The default example aims to show the usage of externally hosted synthetic generator models (e.g. models hosted on on build.nvidia.com, or some other commercial gateways). This requires the usage of an API key to be able to be able to use that model. Which brings us to your next question:

Is there a way we can generate synthetic data without OpenAI client and using a locally downloaded LLM model?

You can certainly use locally downloaded models. The easiest path to do that would be to deploy the model yourself and make it available through an HTTP endpoint. This can be done via Nvidia NIMs (sign up for early access at https://developer.nvidia.com/nemo-microservices), or other solutions such as vLLM or TGI.

Once you deploy the model, you can treat it like any other externally hosted LLM and start querying it via the OpenAI API and your local endpoint (e.g. http://localhost:8000). Make sure to update the endpoint locations inside main.py so that it knows where to look for.

Hope this helps!

Sep 10 '24 20:09 Maghoumi

Hi @praxi-roshan, let me know the status of this issue, and whether we are good to close it.

Also, I also opened a new PR to improve the documentation for using LLM endpoints for synthetic data generation, hopefully that should make it a bit more clear for anybody else who's looking for using arbitrary LLM model deployments to run the tutorial: https://github.com/NVIDIA/NeMo-Curator/pull/301

Oct 14 '24 21:10 Maghoumi

Hi @Maghoumi,

I will close the ticket, thank you for the assistance.

Oct 18 '24 06:10 praxi-roshan

NeMo-Curator NeMo-Curator copied to clipboard

tutorials/peft-curation-with-sdg Failing with Runtime Exceptions.

NeMo-Curator
NeMo-Curator copied to clipboard