XennaExecutor - RuntimeError: Ray Client is already connected. when RAY_ADDRESS is set to remote cluster
Describe the bug
I'm setting up a Ray cluster and then running a pipeline after setting the RAY_ADDRESS variable.
Steps/Code to reproduce bug
RAY_ADDRESS=ray://$head_node_ip:10001
This is the pipeline:
pipeline = Pipeline(
name="FinePDFs Pipeline",
description="Pipeline to process FinePDFs dataset",
)
pipeline.add_stage(
ParquetReader(file_paths=data_paths, fields=["text", "language"], blocksize="128MB", read_kwargs={"dtype_backend":"numpy_nullable"})
)
pipeline.add_stage(
QualityClassifier(cache_dir=HF_CACHE, pred_column="quality", model_inference_batch_size=256)
)
pipeline.add_stage(
ParquetWriter(str(output_dir) , write_kwargs={"partition_cols": ["language", "quality"]})
)
pipeline.describe()
executor = XennaExecutor(
config={
"execution_mode": "streaming",
}
)
results = pipeline.run(executor=executor)
And this is the log that terminates with an error that suggests that it's trying to call ray.init twice:
2025-10-08 18:05:43.753 | INFO | nemo_curator.pipeline.pipeline:add_stage:61 - Added stage 'parquet_reader' to pipeline 'FinePDFs Pipeline'
2025-10-08 18:05:43.754 | INFO | nemo_curator.pipeline.pipeline:add_stage:61 - Added stage 'quality_classifier_deberta_classifier' to pipeline 'FinePDFs Pipeline'
2025-10-08 18:05:43.759 | INFO | nemo_curator.pipeline.pipeline:add_stage:61 - Added stage 'parquet_writer' to pipeline 'FinePDFs Pipeline'
2025-10-08 18:05:43.759 | INFO | nemo_curator.pipeline.pipeline:build:70 - Planning pipeline: FinePDFs Pipeline
2025-10-08 18:05:43.760 | INFO | nemo_curator.pipeline.pipeline:_decompose_stages:106 - Decomposing composite stage: parquet_reader
2025-10-08 18:05:43.760 | INFO | nemo_curator.pipeline.pipeline:_decompose_stages:120 - Expanded 'parquet_reader' into 2 execution stages
2025-10-08 18:05:43.760 | INFO | nemo_curator.pipeline.pipeline:_decompose_stages:106 - Decomposing composite stage: quality_classifier_deberta_classifier
2025-10-08 18:05:43.760 | INFO | nemo_curator.pipeline.pipeline:_decompose_stages:120 - Expanded 'quality_classifier_deberta_classifier' into 2 execution stages
2025-10-08 18:05:43.760 | INFO | nemo_curator.backends.xenna.executor:execute:129 - Execution mode: STREAMING
2025-10-08 18:05:43,831 INFO worker.py:1630 -- Using address ray://10.1.0.64:10001 set in the environment variable RAY_ADDRESS
2025-10-08 18:05:43,908 INFO client_builder.py:242 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error, log_to_driver
2025-10-08 18:05:51.642 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.642 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.642 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=2, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.642 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=1)
2025-10-08 18:05:51.642 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.643 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.643 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.643 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.643 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.643 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=2, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.644 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=2, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.644 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=1)
2025-10-08 18:05:51.644 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=1)
2025-10-08 18:05:51.644 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.644 | INFO | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
...
Here everything still looks ok, but then:
2025-10-08 18:05:52.694 | INFO | cosmos_xenna.pipelines.private.scheduling.autoscaling_algorithms:run_fragmentation_autoscaler:792 - Running phase 4...
2025-10-08 18:05:52.699 | INFO | cosmos_xenna.pipelines.private.streaming:run_pipeline:428 - Done calculating autoscaling...
[36m(Stage 00 - FilePartitioningStage pid=1000177)[0m 2025-10-08 18:06:00.226 | INFO | cosmos_xenna.ray_utils.stage_worker:setup_on_node:389 - Setting up actor for stage=Stage 00 - FilePartitioningStage on node=877ffd8b9d049c534f32b20613d5383f757f8e2824df42ec1d71ef7a
[36m(Stage 00 - FilePartitioningStage pid=1000177)[0m 2025-10-08 18:06:00.226 | INFO | cosmos_xenna.ray_utils.stage_worker:setup_on_node:392 - Finished setting up actor for stage=Stage 00 - FilePartitioningStage on node=877ffd8b9d049c534f32b20613d5383f757f8e2824df42ec1d71ef7a
[36m(Stage 01 - ParquetReaderStage pid=1000178)[0m 2025-10-08 18:06:00.334 | INFO | cosmos_xenna.ray_utils.stage_worker:setup_on_node:389 - Setting up actor for stage=Stage 01 - ParquetReaderStage on node=877ffd8b9d049c534f32b20613d5383f757f8e2824df42ec1d71ef7a
[36m(Stage 01 - ParquetReaderStage pid=1000178)[0m 2025-10-08 18:06:00.334 | INFO | cosmos_xenna.ray_utils.stage_worker:setup_on_node:392 - Finished setting up actor for stage=Stage 01 - ParquetReaderStage on node=877ffd8b9d049c534f32b20613d5383f757f8e2824df42ec1d71ef7a
[36m(Stage 00 - FilePartitioningStage pid=2582737, ip=10.1.0.68)[0m 2025-10-08 18:06:00.365 | INFO | cosmos_xenna.ray_utils.stage_worker:setup_on_node:389 - Setting up actor for stage=Stage 00 - FilePartitioningStage on node=95a771687958b11f60c21c002353998b3efa4994387964ce2975ce55
[36m(Stage 00 - FilePartitioningStage pid=2582737, ip=10.1.0.68)[0m 2025-10-08 18:06:00.366 | INFO | cosmos_xenna.ray_utils.stage_worker:setup_on_node:392 - Finished setting up actor for stage=Stage 00 - FilePartitioningStage on node=95a771687958b11f60c21c002353998b3efa4994387964ce2975ce55
[36m(Stage 01 - ParquetReaderStage pid=2582739, ip=10.1.0.68)[0m 2025-10-08 18:06:00.364 | INFO | cosmos_xenna.ray_utils.stage_worker:setup_on_node:389 - Setting up actor for stage=Stage 01 - ParquetReaderStage on node=95a771687958b11f60c21c002353998b3efa4994387964ce2975ce55
[36m(Stage 01 - ParquetReaderStage pid=2582739, ip=10.1.0.68)[0m 2025-10-08 18:06:00.364 | INFO | cosmos_xenna.ray_utils.stage_worker:setup_on_node:392 - Finished setting up actor for stage=Stage 01 - ParquetReaderStage on node=95a771687958b11f60c21c002353998b3efa4994387964ce2975ce55
2025-10-08 18:06:04.988 | INFO | cosmos_xenna.pipelines.private.monitoring:_make_stats:349 - Took 0.2464303970336914 seconds to get node resource info.
2025-10-08 18:06:04,993 INFO client_builder.py:242 -- Passing the following kwargs to ray.init() on the server: log_to_driver
2025-10-08 18:06:04.993 | ERROR | nemo_curator.backends.xenna.executor:execute:144 - Pipeline execution failed: Ray Client is already connected. Maybe you called ray.init("ray://<address>") twice by accident?
Traceback (most recent call last):
File "/leonardo_work/iGen_train/fdambro1/ai-dataset-ingestion/src/domyn_data_pipelines/processing/finepdfs/pipeline_new_api.py", line 87, in <module>
results = pipeline.run(executor=executor)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/Curator/nemo_curator/pipeline/pipeline.py", line 197, in run
return executor.execute(self.stages, initial_tasks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/Curator/nemo_curator/backends/xenna/executor.py", line 141, in execute
results = pipelines_v1.run_pipeline(pipeline_spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/pipelines.py", line 168, in run_pipeline
return streaming.run_pipeline(pipeline_spec, cluster_resources)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/streaming.py", line 446, in run_pipeline
if monitor.update(len(input_queue), queue_lengths, pool_extra_metadatas) and (last_stats is not None):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/monitoring.py", line 323, in update
stats = PipelinestatsWithTime(time.time(), self._make_stats(input_len, ext_output_lens, task_metadata_per_pool))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/monitoring.py", line 352, in _make_stats
cluster_info = make_ray_cluster_info()
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/monitoring.py", line 245, in make_ray_cluster_info
actors=get_ray_actors(),
^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/monitoring.py", line 138, in get_ray_actors
actors_data: Any = ray.util.state.list_actors(filters=[("state", "=", "ALIVE")], limit=limit)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/util/state/api.py", line 817, in list_actors
return StateApiClient(address=address).list(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/util/state/api.py", line 140, in __init__
api_server_url = get_address_for_submission_client(address)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/dashboard/utils.py", line 728, in get_address_for_submission_client
address = ray_client_address_to_api_server_url(address)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/dashboard/utils.py", line 650, in ray_client_address_to_api_server_url
with ray.init(address=address) as client_context:
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/_private/worker.py", line 1655, in init
ctx = builder.connect()
^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/client_builder.py", line 173, in connect
client_info_dict = ray.util.client_connect.connect(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/ray/util/client_connect.py", line 42, in connect
raise RuntimeError(
RuntimeError: Ray Client is already connected. Maybe you called ray.init("ray://<address>") twice by accident?
Expected behavior
The pipeline should be submitted to the cluster and ray should not be instantiated twice.
Additional context
This is running on slurm with singularity as container engine (no pyxis available) and if needed I can share the script that it's being executed, it's really similar to the current PR #1168. Please note, on single node, no issue whatsoever.
Thanks for opening @federico-dambrosio. Quick update: I have been able to reproduce but don't have a concrete root cause yet. We'll share an update soon when we have something.
Hi, Can you please try with the following changes:
https://github.com/NVIDIA-NeMo/Curator/pull/1168#discussion_r2419118931 https://github.com/NVIDIA-NeMo/Curator/pull/1168#discussion_r2419121802
I was unable to get a slurm job running but was able to repro this error locally and the above fixed it.
Hey @abhinavg4, that worked! For future reference, since our slurm cluster uses singularity, I had to update the script so that the head process and the pipeline process share the same /tmp (simply adding --bind /shared/fs/folder:/tmp), because I was getting this warning:
2025-10-10 12:46:08,068 INFO node.py:1016 -- Can't find a `node_ip_address.json` file from /tmp/ray_21409494/session_2025-10-10_12-44-36_563161_13. Have you started Ray instance using `ray start` or `ray.init`?
Still, I'm facing another issue: our slurm cluster's compute nodes do not have access to the internet and what I'm doing is:
- setting the cache_dir for the models I'm using (same pipeline, this is working in single node)
- setting the
HF_HOMEenv variable (where theHF_HUB_CACHEis derived by$HF_HOME/hub) - setting
HF_HUB_OFFLINE=1to seemingly force the usage of offline files
But the nodes are still trying to connect to the internet:
2025-10-10 12:55:23.088 | ERROR | cosmos_xenna.ray_utils.actor_pool:_move_pending_node_actor_to_pending:764 - Unexpected error getting node setup result for node 8d04bc70fd3ca10bfa269bf4814ed02e6f7353a7137e02ea8956be2c, actor 11: [36mray::StageWorker.setup_on_node()[39m (pid=287, ip=10.2.0.154, actor_id=30acb765c23cd5dc1d99051601000000, repr=Stage 02 - TokenizerStage)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/opt/venv/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
OSError: [Errno 101] Network is unreachable
The above exception was the direct cause of the following exception:
[36mray::StageWorker.setup_on_node()[39m (pid=287, ip=10.2.0.154, actor_id=30acb765c23cd5dc1d99051601000000, repr=Stage 02 - TokenizerStage)
File "/opt/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 488, in _make_request
raise new_e
File "/opt/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 464, in _make_request
self._validate_conn(conn)
File "/opt/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
conn.connect()
File "/opt/venv/lib/python3.12/site-packages/urllib3/connection.py", line 753, in connect
self.sock = sock = self._new_conn()
^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/urllib3/connection.py", line 213, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x14e6f2f14860>: Failed to establish a new connection: [Errno 101] Network is unreachable
The above exception was the direct cause of the following exception:
[36mray::StageWorker.setup_on_node()[39m (pid=287, ip=10.2.0.154, actor_id=30acb765c23cd5dc1d99051601000000, repr=Stage 02 - TokenizerStage)
File "/opt/venv/lib/python3.12/site-packages/requests/adapters.py", line 644, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/urllib3/util/retry.py", line 519, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/nvidia/quality-classifier-deberta/tree/main/additional_chat_templates?recursive=False&expand=False (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x14e6f2f14860>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
During handling of the above exception, another exception occurred:
[36mray::StageWorker.setup_on_node()[39m (pid=287, ip=10.2.0.154, actor_id=30acb765c23cd5dc1d99051601000000, repr=Stage 02 - TokenizerStage)
File "/opt/Curator/nemo_curator/stages/text/models/tokenizer.py", line 102, in setup_on_node
self._setup(local_files_only=False)
File "/opt/Curator/nemo_curator/stages/text/models/tokenizer.py", line 115, in _setup
self.tokenizer = AutoTokenizer.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1116, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2012, in from_pretrained
for template in list_repo_templates(
^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 167, in list_repo_templates
return [
^
File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/hf_api.py", line 3177, in list_repo_tree
for path_info in paginate(path=tree_url, headers=headers, params={"recursive": recursive, "expand": expand}):
File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/utils/_pagination.py", line 36, in paginate
r = session.get(path, params=params, headers=headers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/requests/sessions.py", line 602, in get
return self.request("GET", url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/utils/_http.py", line 96, in send
return super().send(request, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/requests/adapters.py", line 677, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/nvidia/quality-classifier-deberta/tree/main/additional_chat_templates?recursive=False&expand=False (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x14e6f2f14860>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"), '(Request ID: 8c9d8a50-925c-46c6-821e-5992c24c8d34)')
The above exception was the direct cause of the following exception:
[36mray::StageWorker.setup_on_node()[39m (pid=287, ip=10.2.0.154, actor_id=30acb765c23cd5dc1d99051601000000, repr=Stage 02 - TokenizerStage)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 390, in setup_on_node
retry.do_with_retries(func_to_call, max_attempts=self._params.num_node_setup_retries)
File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/utils/retry.py", line 56, in do_with_retries
return func()
^^^^^^
File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 386, in func_to_call
return self._stage_interface.setup_on_node(resources.NodeInfo(node_location), copy.deepcopy(metadata))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/specs.py", line 432, in setup_on_node
self._stage.setup_on_node(node_info, worker_metadata)
File "/opt/Curator/nemo_curator/backends/xenna/adapter.py", line 88, in setup_on_node
super().setup_on_node(generic_node_info, generic_worker_metadata)
File "/opt/Curator/nemo_curator/backends/base.py", line 105, in setup_on_node
self.stage.setup_on_node(node_info, worker_metadata)
File "/opt/Curator/nemo_curator/stages/text/models/tokenizer.py", line 105, in setup_on_node
raise RuntimeError(msg) from e
RuntimeError: Failed to download nvidia/quality-classifier-deberta
What I noticed in Curator's code is that the setup_node method is different from setup:
def setup_on_node(self, _node_info: NodeInfo | None = None, _worker_metadata: WorkerMetadata = None) -> None:
try:
snapshot_download(
repo_id=self.model_identifier,
cache_dir=self.cache_dir,
token=self.hf_token,
local_files_only=False,
)
self._setup(local_files_only=False)
except Exception as e:
msg = f"Failed to download {self.model_identifier}"
raise RuntimeError(msg) from e
vs
def setup(self, _: WorkerMetadata | None = None) -> None:
self._setup(local_files_only=True)
All this, considering that snapshot_download normally handles the presence of the HF_HUB_OFFLINE variable gracefully:
if not local_files_only:
# try/except logic to handle different errors => taken from `hf_hub_download`
try:
# if we have internet connection we want to list files to download
repo_info = api.repo_info(repo_id=repo_id, repo_type=repo_type, revision=revision)
except (requests.exceptions.SSLError, requests.exceptions.ProxyError):
# Actually raise for those subclasses of ConnectionError
raise
except (
requests.exceptions.ConnectionError,
requests.exceptions.Timeout,
OfflineModeIsEnabled, <----- For this
) as error:
# Internet connection is down
# => will try to use local files only
api_call_error = error
pass
Maybe the variable is not being forwarded for some reason?
For reference, I'm running the pipeline on the head node like this:
srun -N1 -n1 -c1 \
--overlap --overcommit --cpu-bind=none --mpi=none \
-w "${HEAD_NODE_NAME}" \
singularity exec \
--bind="$HEAD_CONTAINER_MOUNTS" \
--containall \
--nv \
"$IMAGE" \
bash -c "HF_HOME=$HF_HOME HF_HUB_OFFLINE=1 PYTHONPATH=$PYTHONPATH PYTORCH_CUDA_ALLOC_CONF=$PYTORCH_CUDA_ALLOC_CONF RAY_ADDRESS=$RAY_GCS_ADDRESS $RUN_COMMAND"
Hi @federico-dambrosio were you able to determine if it was indeed a propagation issue? A setup like:
export HF_HOME=$HF_HOME
export HF_HUB_OFFLINE=1
...
srun --export=ALL -N1 -n1 -c1 \
...
(and/or inline exports inside the container like bash -c "export HF_HOME=$HF_HOME && export HF_HUB_OFFLINE=1 && ...) may work in this case.
An explicit
import os
os.environ["HF_HUB_OFFLINE"] = "1"
at the very top of your Python script (i.e., before all other imports) may work too.
It sounds like you are already setting the cache_dir correctly, but I wanted to verify that it looks something like /your/path/to/.cache/huggingface/hub?
Hey @sarahyurick, thank you for the feedback! After a few tries and different ways to make sure the variables were propagated correctly, the pipeline started.
As far as I know, --export=ALL is done default, even without specifying it and what was weird is that checking the HF_HUB_OFFLINE variable inside the pipeline script gave me 1.
I think the trick in my case was that setting the variable only for the executed command (so, not exporting the variables inside the bash command) was not enough. Maybe they were set for the main process, and not propagated to the children processes in the same containers?
So, I tried in 2 ways:
- like you suggested, explicitly exporting the variables -> it worked
- exporting
SINGULARITYENV_HF_HOMEandSINGULARITYENV_HF_HUB_OFFLINEin the main sbatch script, which then are propagated to the singularity container process -> it worked