NeMo-Curator
NeMo-Curator copied to clipboard
tutorials/peft-curation-with-sdg Failing with Runtime Exceptions.
We have attempted to run tutorials/peft-curation-with-sdg and facing runtime errors, details are mentioned below with the environment setup information we tried.
python ./main.py \
--api-key <Token From: https://build.nvidia.com/nvidia/nemotron-4-340b-instruct?integrate_nim=true&hosted_api=true > \
--device gpu \
--synth-gen-rounds 1 \
--synth-gen-ratio 0.001 \
--synth-gen-model "nvidia/nemotron-4-340b-instruct"
Environment Setup
conda create --name nemo python==3.10.12
conda activate nemo
apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install nemo_toolkit['all']
git clone https://github.com/NVIDIA/NeMo.git
cd ./NeMo
apt install gcc-12 g++-12
# Then you need to set this version as default one:
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 1
conda activate nemo
cd ./NeMo
python ./setup.py install
conda install rapidsai::cudf
conda install rapidsai::dask-cudf
conda install rapidsai::cuml
pip install annotated-types
pip install pydantic-core
pip install httpcore
pip install lxml-html-clean
Errors
Errors for Dataset: https://huggingface.co/datasets/ymoslem/MedicalSciences-StackExchange/resolve/main/medical.stackexchange-questions-answers.json?download=true
Key: ('lambda-cfabe91fdbaed95ddbb48379ec3121d3', 9)
State: executing
Function: execute_task
args: ((<function reify at 0x7d3869cdfbe0>, (<function map_chunk at 0x7d3869cfc040>, <function SemanticClusterLevelDedup.compute_semantic_match_dfs.<locals>.<lambda> at 0x7d36e7f8fbe0>, [[9]], None, {})))
kwargs: {}
Exception: 'ValueError("invalid literal for int() with base 10: \'law-stackexchange-qa-9905\'")'
Traceback: ' File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 1868, in reify\n seq = list(seq)\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 2056, in __next__\n return self.f(*vals)\n File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 456, in <lambda>\n lambda cluster_id: get_semantic_matches_per_cluster(\n File "/data/ops/roshan/remote/nemo-curator/nemo_curator/utils/semdedup_utils.py", line 252, in get_semantic_matches_per_cluster\n text_ids = cluster_i[:, 0].astype(id_col_type)\n'
2024-08-14 12:04:51,964 - distributed.worker - WARNING - Compute Failed
Key: ('lambda-cfabe91fdbaed95ddbb48379ec3121d3', 8)
State: executing
Function: execute_task
args: ((<function reify at 0x7d3869cdfbe0>, (<function map_chunk at 0x7d3869cfc040>, <function SemanticClusterLevelDedup.compute_semantic_match_dfs.<locals>.<lambda> at 0x7d36f199f490>, [[8]], None, {})))
kwargs: {}
Exception: 'ValueError("invalid literal for int() with base 10: \'law-stackexchange-qa-19252\'")'
Traceback: ' File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 1868, in reify\n seq = list(seq)\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 2056, in __next__\n return self.f(*vals)\n File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 456, in <lambda>\n lambda cluster_id: get_semantic_matches_per_cluster(\n File "/data/ops/roshan/remote/nemo-curator/nemo_curator/utils/semdedup_utils.py", line 252, in get_semantic_matches_per_cluster\n text_ids = cluster_i[:, 0].astype(id_col_type)\n'
Traceback (most recent call last):
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 415, in <module>
main()
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 406, in main
train_fp_curated = run_pipeline(args, train_fp)
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 310, in run_pipeline
dataset_df, n_rows_before, n_rows_after = run_curation_pipeline(args, jsonl_fp)
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 227, in run_curation_pipeline
dataset = gpu_curation_steps(dataset)
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/meta.py", line 22, in __call__
dataset = module(dataset)
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 141, in semantic_dedupe
dedup_ids = semdup(dataset)
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 570, in __call__
self.semantic_cluster_dedup.compute_semantic_match_dfs(self.eps_thresholds)
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 467, in compute_semantic_match_dfs
tasks.compute()
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 376, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 662, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 1868, in reify
seq = list(seq)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/bag/core.py", line 2056, in __next__
return self.f(*vals)
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 456, in <lambda>
lambda cluster_id: get_semantic_matches_per_cluster(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/utils/semdedup_utils.py", line 252, in get_semantic_matches_per_cluster
text_ids = cluster_i[:, 0].astype(id_col_type)
ValueError: invalid literal for int() with base 10: 'law-stackexchange-qa-9905'
ERROR conda.cli.main_run:execute(125): `conda run python /data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py --api-key Ynvapi-iGgFsXrdl4FkGA7MFPzNgL3r8o2c_dcZ7SGT8Yx7ZlYNWNix2xnX8IezbxTVbcPW --device gpu --synth-gen-rounds 1 --synth-gen-ratio 0.001 --synth-gen-model nvidia/nemotron-4-340b-instruct` failed. (See above for error)
Errors for Dataset: https://huggingface.co/datasets/ymoslem/Law-StackExchange/resolve/main/law-stackexchange-questions-answers.json
future: <Task finished name='Task-8451' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py:75> exception=AuthenticationError('Error code: 401')>
Traceback (most recent call last):
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/synthetic_gen.py", line 221, in _prompt_model
gen_title, gen_question, gen_answer = await asyncio.gather(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 421, in generate_closed_qa_instructions
openline_response = await self._prompt(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 77, in _prompt
return await self.client.query_model(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/services/openai_client.py", line 123, in query_model
response = await self.client.chat.completions.create(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/resources/chat/completions.py", line 1339, in create
return await self._post(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1815, in post
return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1509, in request
return await self._request(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1610, in _request
raise self._make_status_error_from_response(err.response) from None
openai.AuthenticationError: Error code: 401
Task exception was never retrieved
future: <Task finished name='Task-8452' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py:75> exception=AuthenticationError('Error code: 401')>
Traceback (most recent call last):
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/synthetic_gen.py", line 221, in _prompt_model
gen_title, gen_question, gen_answer = await asyncio.gather(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 421, in generate_closed_qa_instructions
openline_response = await self._prompt(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 77, in _prompt
return await self.client.query_model(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/services/openai_client.py", line 123, in query_model
response = await self.client.chat.completions.create(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/resources/chat/completions.py", line 1339, in create
return await self._post(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1815, in post
return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1509, in request
return await self._request(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1610, in _request
raise self._make_status_error_from_response(err.response) from None
openai.AuthenticationError: Error code: 401
Task exception was never retrieved
future: <Task finished name='Task-8455' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py:75> exception=AuthenticationError('Error code: 401')>
Traceback (most recent call last):
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/synthetic_gen.py", line 221, in _prompt_model
gen_title, gen_question, gen_answer = await asyncio.gather(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 421, in generate_closed_qa_instructions
openline_response = await self._prompt(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 77, in _prompt
return await self.client.query_model(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/services/openai_client.py", line 123, in query_model
response = await self.client.chat.completions.create(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/resources/chat/completions.py", line 1339, in create
return await self._post(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1815, in post
return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1509, in request
return await self._request(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1610, in _request
raise self._make_status_error_from_response(err.response) from None
openai.AuthenticationError: Error code: 401
Task exception was never retrieved
future: <Task finished name='Task-8458' coro=<tqdm_asyncio.gather.<locals>.wrap_awaitable() done, defined at /home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py:75> exception=AuthenticationError('Error code: 401')>
Traceback (most recent call last):
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/tqdm/asyncio.py", line 76, in wrap_awaitable
return i, await f
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/synthetic_gen.py", line 221, in _prompt_model
gen_title, gen_question, gen_answer = await asyncio.gather(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 421, in generate_closed_qa_instructions
openline_response = await self._prompt(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/synthetic/async_nemotron.py", line 77, in _prompt
return await self.client.query_model(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/services/openai_client.py", line 123, in query_model
response = await self.client.chat.completions.create(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/resources/chat/completions.py", line 1339, in create
return await self._post(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1815, in post
return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1509, in request
return await self._request(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/openai-1.40.2-py3.10.egg/openai/_base_client.py", line 1610, in _request
raise self._make_status_error_from_response(err.response) from None
openai.AuthenticationError: Error code: 401
Errors for Dataset: Created a custom dataset based on law-stackexchange-questions-answers.json with one single record.
Reading 1 files
Traceback (most recent call last):
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
return self._engine.get_loc(casted_key)
File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'title'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 415, in <module>
main()
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 406, in main
train_fp_curated = run_pipeline(args, train_fp)
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 310, in run_pipeline
dataset_df, n_rows_before, n_rows_after = run_curation_pipeline(args, jsonl_fp)
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 207, in run_curation_pipeline
dataset = cpu_curation_steps(dataset)
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/meta.py", line 22, in __call__
dataset = module(dataset)
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/modify.py", line 31, in __call__
dataset.df[self.text_field] = dataset.df[self.text_field].apply(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/dataframe/core.py", line 4951, in __getitem__
meta = self._meta[_extract_meta(key)]
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'title'
ERROR conda.cli.main_run:execute(125): `conda run python /data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py --api-key Ynvapi-iGgFsXrdl4FkGA7MFPzNgL3r8o2c_dcZ7SGT8Yx7ZlYNWNix2xnX8IezbxTVbcPW --device gpu --synth-gen-rounds 1 --synth-gen-ratio 0.001 --synth-gen-model nvidia/nemotron-4-340b-instruct` failed. (See above for error)
Errors for Dataset: Created a custom dataset based on law-stackexchange-questions-answers.json with 100k+ entries.
Reading 1 files
/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
2024-08-14 14:13:03,126 - distributed.worker - WARNING - Compute Failed
Key: ('to-parquet-fc0472286679d49f53313f93027e9065', 0)
State: executing
Function: subgraph_callable-5e534f6954054123de078e717e9367e9
args: ((0,), '<crossfit.backend.torch.op.base.Predictor object a-18befc5a9c8e11d3a125f4e2f611e20c', {'number': 0, 'division': None}, '<crossfit.op.tokenize.Tokenizer object at 0x7cca6c-3510ecebf68286e44567efef058f7f2b', 'to_cudf_dispatch-25a6cae5e12e52fe804538d37bb046b3', 'assign-42e4fd5d877ed0406e2080192bc11744', 'text', 'answer', '\n', 'title', 'question', <bound method FilterLowScores.keep_document of <filters.FilterLowScores object at 0x737a3c503310>>, 'apply-f00d3b1f0bc5365043d5ad449cc081ad', <bound method FilterLowScores.score_document of <filters.FilterLowScores object at 0x737a3c503310>>, 'getitem-340ac611e67c7235a472afe0fa6529fa', 'answer_score', <bound method FilterLowScores.keep_document of <filters.FilterLowScores object at 0x737a3c503370>>, 'apply-5d07bbb647c3e1c3ea837331d3b646b3', <bound method FilterLowScores.score_document of <filters.FilterLowScores object at 0x737a3c503370>>, 'getitem-ad0338cf88d799fa6c5767bd2922a2f4', 'question_score', <bound method WordCountFilter.keep_do
kwargs: {}
Exception: "IndexError('list index out of range')"
Traceback: ' File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/optimization.py", line 1001, in __call__\n return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 157, in get\n result = _execute_task(task, cache)\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in _execute_task\n return func(*(_execute_task(a, cache) for a in args))\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in <genexpr>\n return func(*(_execute_task(a, cache) for a in args))\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in _execute_task\n return func(*(_execute_task(a, cache) for a in args))\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in <genexpr>\n return func(*(_execute_task(a, cache) for a in args))\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 121, in _execute_task\n return [_execute_task(a, cache) for a in arg]\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 121, in <listcomp>\n return [_execute_task(a, cache) for a in arg]\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in _execute_task\n return func(*(_execute_task(a, cache) for a in args))\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in <genexpr>\n return func(*(_execute_task(a, cache) for a in args))\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 121, in _execute_task\n return [_execute_task(a, cache) for a in arg]\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 121, in <listcomp>\n return [_execute_task(a, cache) for a in arg]\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/core.py", line 127, in _execute_task\n return func(*(_execute_task(a, cache) for a in args))\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/utils.py", line 78, in apply\n return func(*args, **kwargs)\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/dataframe/core.py", line 7164, in apply_and_enforce\n df = func(*args, **kwargs)\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/dataframe/core.py", line 7073, in func\n return orig_func(*args, **kwargs, partition_info=partition_info)\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/base.py", line 96, in __call__\n output = self.call(data, *args, **kwargs)\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 152, in call\n input_ids, attention_mask = self.call_column(data[col])\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 117, in call_column\n tokenized_data = self.tokenize_strings(text).copy()\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 68, in tokenize_strings\n tokenized_data = tokenizer.batch_encode_plus(\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3338, in batch_encode_plus\n return self._batch_encode_plus(\n File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 561, in _batch_encode_plus\n for key in tokens_and_encodings[0][0].keys():\n'
Traceback (most recent call last):
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 415, in <module>
main()
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 406, in main
train_fp_curated = run_pipeline(args, train_fp)
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 310, in run_pipeline
dataset_df, n_rows_before, n_rows_after = run_curation_pipeline(args, jsonl_fp)
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 227, in run_curation_pipeline
dataset = gpu_curation_steps(dataset)
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/meta.py", line 22, in __call__
dataset = module(dataset)
File "/data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py", line 141, in semantic_dedupe
dedup_ids = semdup(dataset)
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 568, in __call__
embeddings_dataset = self.embedding_creator(dataset)
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/modules/semantic_dedup.py", line 202, in __call__
write_to_disk(
File "/data/ops/roshan/remote/nemo-curator/nemo_curator/utils/distributed_utils.py", line 519, in write_to_disk
df.to_parquet(output_file_dir, write_index=False)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
return func(*args, **kwargs)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask_cudf/core.py", line 283, in to_parquet
return to_parquet(self, path, *args, **kwargs)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/dataframe/io/parquet/core.py", line 1047, in to_parquet
out = out.compute(**compute_kwargs)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 376, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/dask/base.py", line 662, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/base.py", line 96, in __call__
output = self.call(data, *args, **kwargs)
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 152, in call
input_ids, attention_mask = self.call_column(data[col])
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 117, in call_column
tokenized_data = self.tokenize_strings(text).copy()
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/crossfit-0.0.4-py3.10.egg/crossfit/op/tokenize.py", line 68, in tokenize_strings
tokenized_data = tokenizer.batch_encode_plus(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3338, in batch_encode_plus
return self._batch_encode_plus(
File "/home/praxi-usr/anaconda3/envs/nemo/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 561, in _batch_encode_plus
for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range
ERROR conda.cli.main_run:execute(125): `conda run python /data/ops/roshan/remote/nemo-curator/tutorials/peft-curation-with-sdg/main.py --api-key Ynvapi-iGgFsXrdl4FkGA7MFPzNgL3r8o2c_dcZ7SGT8Yx7ZlYNWNix2xnX8IezbxTVbcPW --device gpu --synth-gen-rounds 1 --synth-gen-ratio 0.001 --synth-gen-model nvidia/nemotron-4-340b-instruct` failed. (See above for error)
Hi @praxi-roshan, thanks for opening this issue and providing detailed repro steps. I just attempted running the tutorial on the latest main commit with a freshly setup environment and it seems to work fine for me.
What I tried:
python3 --version # mine was 3.10.13
python3 -m venv ./venv
source ./venv/bin/activate
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator
pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"
python tutorials/peft-curation-with-sdg/main.py --api-key <my_key> --device gpu
A few questions for you, so I can better assist you:
- Did you happen to try running the tutorial without any modifications to see if works at all for you?
- On that note, did you try running without providing an API key to see whether the regular data curation bits work as expected? I see some authentication errors, which may suggest your build.nvidia.com key may not have worked properly.
- Similarly, did you happen to try with
--device cputo further isolate the scope of the issue? - I noticed that you included your command for installing the NeMo framework (
pip install nemo_toolkit['all']), but I didn't quite figure out how the NeMo Curator was installed. How was it installed?
Hi @Maghoumi,
Thank you for the instructions, i tried installing based on your steps I'm getting a different error now:
OSError: libcudart.so: cannot open shared object file: No such file or directory
I will troubleshoot and provide more feedback.
4. I noticed that you included your command for installing the NeMo framework (pip install nemo_toolkit['all']), but I didn't quite figure out how the NeMo Curator was installed. How was it installed?
Apologies in the post i mentioned wrong repo...! (https://github.com/NVIDIA/NeMo.git), This is how i installed
conda create --name nemo python==3.10.12
conda activate nemo
apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install nemo_toolkit['all']
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd ./NeMo-Curator
apt install gcc-12 g++-12
# Then you need to set this version as default one:
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 1
conda activate nemo
cd ./NeMo-Curator
python ./setup.py install
conda install rapidsai::cudf
conda install rapidsai::dask-cudf
conda install rapidsai::cuml
pip install annotated-types
pip install pydantic-core
pip install httpcore
pip install lxml-html-clean
Thank you for the instructions, i tried installing based on your steps I'm getting a different error now: OSError: libcudart.so: cannot open shared object file: No such file or directory
At which stage of running the tutorial do you encounter this issue? This error typically indicates an issue with CUDA installation and would mean the CUDA library files were not found. Since this is OS dependent, I would advise ensuring CUDA is properly installed in your environment before running the code. I often tend to use the following snippet to ensure CUDA is installed properly when I setup new environments (assuming that you have PyTorch installed):
import torch
print(torch.cuda.is_available()) # Should print True if CUDA is installed properly
Looking at your install instructions, I see you have:
conda activate nemo
cd ./NeMo-Curator
python ./setup.py install # < -- have you tried `pip install .` instead?
Asking because pip install . should ensure dependencies are also handled properly.
Hi,
I tried on a new Ubuntu 23.10 Server environment with the below mentioned environment setup, and apparently it worked without any errors and created the curated files.
I have attached the complete log file of the curation process for your reference to confirm that it worked fine.
- Looks like synthetic data generation not possible without an API key, could you please let us known is there a workaround that we can generate synthetic data without an API Key?
- Is there a way we can generate synthetic data without OpenAI client and using a locally downloaded LLM model?
conda create --name nemo python==3.10.13
conda deactivate
conda activate nemo
pip install Cython packaging
# Installed this to resolve this error: "OSError: libcudart.so: cannot open shared object file: No such file or directory"
conda install cudatoolkit=11.8
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd ./NeMo-Curator
apt install gcc-12 g++-12
# Then you need to set this version as default one:
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 1
pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"
python tutorials/peft-curation-with-sdg/main.py --api-key <my_key> --device gpu**
Glad to hear that you were able to follow along after using another environment. Regarding your questions:
Looks like synthetic data generation not possible without an API key, could you please let us known is there a workaround that we can generate synthetic data without an API Key?
As you may have noticed already, currently the framework assumes that the LLM model that can generate synthetic data is accessible via some sort of an HTTP endpoint, which is either compatible with the OpenAI API, or the NeMo Framework API.
The default example aims to show the usage of externally hosted synthetic generator models (e.g. models hosted on on build.nvidia.com, or some other commercial gateways). This requires the usage of an API key to be able to be able to use that model. Which brings us to your next question:
Is there a way we can generate synthetic data without OpenAI client and using a locally downloaded LLM model?
You can certainly use locally downloaded models. The easiest path to do that would be to deploy the model yourself and make it available through an HTTP endpoint. This can be done via Nvidia NIMs (sign up for early access at https://developer.nvidia.com/nemo-microservices), or other solutions such as vLLM or TGI.
Once you deploy the model, you can treat it like any other externally hosted LLM and start querying it via the OpenAI API and your local endpoint (e.g. http://localhost:8000). Make sure to update the endpoint locations inside main.py so that it knows where to look for.
Hope this helps!
Hi @praxi-roshan, let me know the status of this issue, and whether we are good to close it.
Also, I also opened a new PR to improve the documentation for using LLM endpoints for synthetic data generation, hopefully that should make it a bit more clear for anybody else who's looking for using arbitrary LLM model deployments to run the tutorial: https://github.com/NVIDIA/NeMo-Curator/pull/301
Hi @Maghoumi,
I will close the ticket, thank you for the assistance.