Curator icon indicating copy to clipboard operation
Curator copied to clipboard

XennaExecutor - RuntimeError: Ray Client is already connected. when RAY_ADDRESS is set to remote cluster

Open federico-dambrosio opened this issue 3 months ago • 5 comments

Describe the bug

I'm setting up a Ray cluster and then running a pipeline after setting the RAY_ADDRESS variable.

Steps/Code to reproduce bug

RAY_ADDRESS=ray://$head_node_ip:10001

This is the pipeline:

pipeline = Pipeline(
    name="FinePDFs Pipeline",
    description="Pipeline to process FinePDFs dataset",
)
pipeline.add_stage(
    ParquetReader(file_paths=data_paths, fields=["text", "language"], blocksize="128MB", read_kwargs={"dtype_backend":"numpy_nullable"})
)
pipeline.add_stage(
    QualityClassifier(cache_dir=HF_CACHE, pred_column="quality", model_inference_batch_size=256)
)
pipeline.add_stage(
    ParquetWriter(str(output_dir) , write_kwargs={"partition_cols": ["language", "quality"]})
)

pipeline.describe()
executor = XennaExecutor(
    config={
        "execution_mode": "streaming",
    }
)
results = pipeline.run(executor=executor)

And this is the log that terminates with an error that suggests that it's trying to call ray.init twice:

2025-10-08 18:05:43.753 | INFO     | nemo_curator.pipeline.pipeline:add_stage:61 - Added stage 'parquet_reader' to pipeline 'FinePDFs Pipeline'
2025-10-08 18:05:43.754 | INFO     | nemo_curator.pipeline.pipeline:add_stage:61 - Added stage 'quality_classifier_deberta_classifier' to pipeline 'FinePDFs Pipeline'
2025-10-08 18:05:43.759 | INFO     | nemo_curator.pipeline.pipeline:add_stage:61 - Added stage 'parquet_writer' to pipeline 'FinePDFs Pipeline'
2025-10-08 18:05:43.759 | INFO     | nemo_curator.pipeline.pipeline:build:70 - Planning pipeline: FinePDFs Pipeline
2025-10-08 18:05:43.760 | INFO     | nemo_curator.pipeline.pipeline:_decompose_stages:106 - Decomposing composite stage: parquet_reader
2025-10-08 18:05:43.760 | INFO     | nemo_curator.pipeline.pipeline:_decompose_stages:120 - Expanded 'parquet_reader' into 2 execution stages
2025-10-08 18:05:43.760 | INFO     | nemo_curator.pipeline.pipeline:_decompose_stages:106 - Decomposing composite stage: quality_classifier_deberta_classifier
2025-10-08 18:05:43.760 | INFO     | nemo_curator.pipeline.pipeline:_decompose_stages:120 - Expanded 'quality_classifier_deberta_classifier' into 2 execution stages
2025-10-08 18:05:43.760 | INFO     | nemo_curator.backends.xenna.executor:execute:129 - Execution mode: STREAMING
2025-10-08 18:05:43,831	INFO worker.py:1630 -- Using address ray://10.1.0.64:10001 set in the environment variable RAY_ADDRESS
2025-10-08 18:05:43,908	INFO client_builder.py:242 -- Passing the following kwargs to ray.init() on the server: ignore_reinit_error, log_to_driver
2025-10-08 18:05:51.642 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.642 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.642 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=2, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.642 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=1)
2025-10-08 18:05:51.642 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.643 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.643 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.643 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.643 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.643 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=2, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.644 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=2, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.644 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=1)
2025-10-08 18:05:51.644 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=1)
2025-10-08 18:05:51.644 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)
2025-10-08 18:05:51.644 | INFO     | nemo_curator.backends.xenna.adapter:required_resources:43 - Resources: Resources(cpus=1.0, gpu_memory_gb=0.0, nvdecs=0, nvencs=0, entire_gpu=False, gpus=0.0)

...

Here everything still looks ok, but then:

2025-10-08 18:05:52.694 | INFO     | cosmos_xenna.pipelines.private.scheduling.autoscaling_algorithms:run_fragmentation_autoscaler:792 - Running phase 4...
2025-10-08 18:05:52.699 | INFO     | cosmos_xenna.pipelines.private.streaming:run_pipeline:428 - Done calculating autoscaling...
[36m(Stage 00 - FilePartitioningStage pid=1000177)[0m 2025-10-08 18:06:00.226 | INFO     | cosmos_xenna.ray_utils.stage_worker:setup_on_node:389 - Setting up actor for stage=Stage 00 - FilePartitioningStage on node=877ffd8b9d049c534f32b20613d5383f757f8e2824df42ec1d71ef7a
[36m(Stage 00 - FilePartitioningStage pid=1000177)[0m 2025-10-08 18:06:00.226 | INFO     | cosmos_xenna.ray_utils.stage_worker:setup_on_node:392 - Finished setting up actor for stage=Stage 00 - FilePartitioningStage on node=877ffd8b9d049c534f32b20613d5383f757f8e2824df42ec1d71ef7a
[36m(Stage 01 - ParquetReaderStage pid=1000178)[0m 2025-10-08 18:06:00.334 | INFO     | cosmos_xenna.ray_utils.stage_worker:setup_on_node:389 - Setting up actor for stage=Stage 01 - ParquetReaderStage on node=877ffd8b9d049c534f32b20613d5383f757f8e2824df42ec1d71ef7a
[36m(Stage 01 - ParquetReaderStage pid=1000178)[0m 2025-10-08 18:06:00.334 | INFO     | cosmos_xenna.ray_utils.stage_worker:setup_on_node:392 - Finished setting up actor for stage=Stage 01 - ParquetReaderStage on node=877ffd8b9d049c534f32b20613d5383f757f8e2824df42ec1d71ef7a
[36m(Stage 00 - FilePartitioningStage pid=2582737, ip=10.1.0.68)[0m 2025-10-08 18:06:00.365 | INFO     | cosmos_xenna.ray_utils.stage_worker:setup_on_node:389 - Setting up actor for stage=Stage 00 - FilePartitioningStage on node=95a771687958b11f60c21c002353998b3efa4994387964ce2975ce55
[36m(Stage 00 - FilePartitioningStage pid=2582737, ip=10.1.0.68)[0m 2025-10-08 18:06:00.366 | INFO     | cosmos_xenna.ray_utils.stage_worker:setup_on_node:392 - Finished setting up actor for stage=Stage 00 - FilePartitioningStage on node=95a771687958b11f60c21c002353998b3efa4994387964ce2975ce55
[36m(Stage 01 - ParquetReaderStage pid=2582739, ip=10.1.0.68)[0m 2025-10-08 18:06:00.364 | INFO     | cosmos_xenna.ray_utils.stage_worker:setup_on_node:389 - Setting up actor for stage=Stage 01 - ParquetReaderStage on node=95a771687958b11f60c21c002353998b3efa4994387964ce2975ce55
[36m(Stage 01 - ParquetReaderStage pid=2582739, ip=10.1.0.68)[0m 2025-10-08 18:06:00.364 | INFO     | cosmos_xenna.ray_utils.stage_worker:setup_on_node:392 - Finished setting up actor for stage=Stage 01 - ParquetReaderStage on node=95a771687958b11f60c21c002353998b3efa4994387964ce2975ce55
2025-10-08 18:06:04.988 | INFO     | cosmos_xenna.pipelines.private.monitoring:_make_stats:349 - Took 0.2464303970336914 seconds to get node resource info.
2025-10-08 18:06:04,993	INFO client_builder.py:242 -- Passing the following kwargs to ray.init() on the server: log_to_driver
2025-10-08 18:06:04.993 | ERROR    | nemo_curator.backends.xenna.executor:execute:144 - Pipeline execution failed: Ray Client is already connected. Maybe you called ray.init("ray://<address>") twice by accident?
Traceback (most recent call last):
  File "/leonardo_work/iGen_train/fdambro1/ai-dataset-ingestion/src/domyn_data_pipelines/processing/finepdfs/pipeline_new_api.py", line 87, in <module>
    results = pipeline.run(executor=executor)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/Curator/nemo_curator/pipeline/pipeline.py", line 197, in run
    return executor.execute(self.stages, initial_tasks)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/Curator/nemo_curator/backends/xenna/executor.py", line 141, in execute
    results = pipelines_v1.run_pipeline(pipeline_spec)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/pipelines.py", line 168, in run_pipeline
    return streaming.run_pipeline(pipeline_spec, cluster_resources)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/streaming.py", line 446, in run_pipeline
    if monitor.update(len(input_queue), queue_lengths, pool_extra_metadatas) and (last_stats is not None):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/monitoring.py", line 323, in update
    stats = PipelinestatsWithTime(time.time(), self._make_stats(input_len, ext_output_lens, task_metadata_per_pool))
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/monitoring.py", line 352, in _make_stats
    cluster_info = make_ray_cluster_info()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/monitoring.py", line 245, in make_ray_cluster_info
    actors=get_ray_actors(),
           ^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/monitoring.py", line 138, in get_ray_actors
    actors_data: Any = ray.util.state.list_actors(filters=[("state", "=", "ALIVE")], limit=limit)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/util/state/api.py", line 817, in list_actors
    return StateApiClient(address=address).list(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/util/state/api.py", line 140, in __init__
    api_server_url = get_address_for_submission_client(address)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/dashboard/utils.py", line 728, in get_address_for_submission_client
    address = ray_client_address_to_api_server_url(address)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/dashboard/utils.py", line 650, in ray_client_address_to_api_server_url
    with ray.init(address=address) as client_context:
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/_private/worker.py", line 1655, in init
    ctx = builder.connect()
          ^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/client_builder.py", line 173, in connect
    client_info_dict = ray.util.client_connect.connect(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/ray/util/client_connect.py", line 42, in connect
    raise RuntimeError(
RuntimeError: Ray Client is already connected. Maybe you called ray.init("ray://<address>") twice by accident?

Expected behavior

The pipeline should be submitted to the cluster and ray should not be instantiated twice.

Additional context

This is running on slurm with singularity as container engine (no pyxis available) and if needed I can share the script that it's being executed, it's really similar to the current PR #1168. Please note, on single node, no issue whatsoever.

federico-dambrosio avatar Oct 08 '25 16:10 federico-dambrosio

Thanks for opening @federico-dambrosio. Quick update: I have been able to reproduce but don't have a concrete root cause yet. We'll share an update soon when we have something.

ayushdg avatar Oct 09 '25 20:10 ayushdg

Hi, Can you please try with the following changes:

https://github.com/NVIDIA-NeMo/Curator/pull/1168#discussion_r2419118931 https://github.com/NVIDIA-NeMo/Curator/pull/1168#discussion_r2419121802

I was unable to get a slurm job running but was able to repro this error locally and the above fixed it.

abhinavg4 avatar Oct 10 '25 09:10 abhinavg4

Hey @abhinavg4, that worked! For future reference, since our slurm cluster uses singularity, I had to update the script so that the head process and the pipeline process share the same /tmp (simply adding --bind /shared/fs/folder:/tmp), because I was getting this warning:

2025-10-10 12:46:08,068	INFO node.py:1016 -- Can't find a `node_ip_address.json` file from /tmp/ray_21409494/session_2025-10-10_12-44-36_563161_13. Have you started Ray instance using `ray start` or `ray.init`?

Still, I'm facing another issue: our slurm cluster's compute nodes do not have access to the internet and what I'm doing is:

  • setting the cache_dir for the models I'm using (same pipeline, this is working in single node)
  • setting the HF_HOME env variable (where the HF_HUB_CACHE is derived by $HF_HOME/hub)
  • setting HF_HUB_OFFLINE=1 to seemingly force the usage of offline files

But the nodes are still trying to connect to the internet:

2025-10-10 12:55:23.088 | ERROR    | cosmos_xenna.ray_utils.actor_pool:_move_pending_node_actor_to_pending:764 - Unexpected error getting node setup result for node 8d04bc70fd3ca10bfa269bf4814ed02e6f7353a7137e02ea8956be2c, actor 11: [36mray::StageWorker.setup_on_node()[39m (pid=287, ip=10.2.0.154, actor_id=30acb765c23cd5dc1d99051601000000, repr=Stage 02 - TokenizerStage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/opt/venv/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
OSError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

[36mray::StageWorker.setup_on_node()[39m (pid=287, ip=10.2.0.154, actor_id=30acb765c23cd5dc1d99051601000000, repr=Stage 02 - TokenizerStage)
  File "/opt/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 488, in _make_request
    raise new_e
  File "/opt/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 464, in _make_request
    self._validate_conn(conn)
  File "/opt/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
    conn.connect()
  File "/opt/venv/lib/python3.12/site-packages/urllib3/connection.py", line 753, in connect
    self.sock = sock = self._new_conn()
                       ^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/urllib3/connection.py", line 213, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x14e6f2f14860>: Failed to establish a new connection: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

[36mray::StageWorker.setup_on_node()[39m (pid=287, ip=10.2.0.154, actor_id=30acb765c23cd5dc1d99051601000000, repr=Stage 02 - TokenizerStage)
  File "/opt/venv/lib/python3.12/site-packages/requests/adapters.py", line 644, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/urllib3/util/retry.py", line 519, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/nvidia/quality-classifier-deberta/tree/main/additional_chat_templates?recursive=False&expand=False (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x14e6f2f14860>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

During handling of the above exception, another exception occurred:

[36mray::StageWorker.setup_on_node()[39m (pid=287, ip=10.2.0.154, actor_id=30acb765c23cd5dc1d99051601000000, repr=Stage 02 - TokenizerStage)
  File "/opt/Curator/nemo_curator/stages/text/models/tokenizer.py", line 102, in setup_on_node
    self._setup(local_files_only=False)
  File "/opt/Curator/nemo_curator/stages/text/models/tokenizer.py", line 115, in _setup
    self.tokenizer = AutoTokenizer.from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 1116, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2012, in from_pretrained
    for template in list_repo_templates(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/hub.py", line 167, in list_repo_templates
    return [
           ^
  File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/hf_api.py", line 3177, in list_repo_tree
    for path_info in paginate(path=tree_url, headers=headers, params={"recursive": recursive, "expand": expand}):
  File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/utils/_pagination.py", line 36, in paginate
    r = session.get(path, params=params, headers=headers)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/huggingface_hub/utils/_http.py", line 96, in send
    return super().send(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/requests/adapters.py", line 677, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/nvidia/quality-classifier-deberta/tree/main/additional_chat_templates?recursive=False&expand=False (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x14e6f2f14860>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"), '(Request ID: 8c9d8a50-925c-46c6-821e-5992c24c8d34)')

The above exception was the direct cause of the following exception:

[36mray::StageWorker.setup_on_node()[39m (pid=287, ip=10.2.0.154, actor_id=30acb765c23cd5dc1d99051601000000, repr=Stage 02 - TokenizerStage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 390, in setup_on_node
    retry.do_with_retries(func_to_call, max_attempts=self._params.num_node_setup_retries)
  File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/utils/retry.py", line 56, in do_with_retries
    return func()
           ^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/ray_utils/stage_worker.py", line 386, in func_to_call
    return self._stage_interface.setup_on_node(resources.NodeInfo(node_location), copy.deepcopy(metadata))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/cosmos_xenna/pipelines/private/specs.py", line 432, in setup_on_node
    self._stage.setup_on_node(node_info, worker_metadata)
  File "/opt/Curator/nemo_curator/backends/xenna/adapter.py", line 88, in setup_on_node
    super().setup_on_node(generic_node_info, generic_worker_metadata)
  File "/opt/Curator/nemo_curator/backends/base.py", line 105, in setup_on_node
    self.stage.setup_on_node(node_info, worker_metadata)
  File "/opt/Curator/nemo_curator/stages/text/models/tokenizer.py", line 105, in setup_on_node
    raise RuntimeError(msg) from e
RuntimeError: Failed to download nvidia/quality-classifier-deberta

What I noticed in Curator's code is that the setup_node method is different from setup:

    def setup_on_node(self, _node_info: NodeInfo | None = None, _worker_metadata: WorkerMetadata = None) -> None:
        try:
            snapshot_download(
                repo_id=self.model_identifier,
                cache_dir=self.cache_dir,
                token=self.hf_token,
                local_files_only=False,
            )
            self._setup(local_files_only=False)
        except Exception as e:
            msg = f"Failed to download {self.model_identifier}"
            raise RuntimeError(msg) from e

vs

    def setup(self, _: WorkerMetadata | None = None) -> None:
        self._setup(local_files_only=True)

All this, considering that snapshot_download normally handles the presence of the HF_HUB_OFFLINE variable gracefully:

    if not local_files_only:
        # try/except logic to handle different errors => taken from `hf_hub_download`
        try:
            # if we have internet connection we want to list files to download
            repo_info = api.repo_info(repo_id=repo_id, repo_type=repo_type, revision=revision)
        except (requests.exceptions.SSLError, requests.exceptions.ProxyError):
            # Actually raise for those subclasses of ConnectionError
            raise
        except (
            requests.exceptions.ConnectionError,
            requests.exceptions.Timeout,
            OfflineModeIsEnabled, <----- For this
        ) as error:
            # Internet connection is down
            # => will try to use local files only
            api_call_error = error
            pass

Maybe the variable is not being forwarded for some reason?

For reference, I'm running the pipeline on the head node like this:

srun -N1 -n1 -c1 \
  --overlap --overcommit --cpu-bind=none --mpi=none \
  -w "${HEAD_NODE_NAME}" \
  singularity exec \
    --bind="$HEAD_CONTAINER_MOUNTS" \
    --containall \
    --nv \
    "$IMAGE" \
    bash -c "HF_HOME=$HF_HOME HF_HUB_OFFLINE=1 PYTHONPATH=$PYTHONPATH PYTORCH_CUDA_ALLOC_CONF=$PYTORCH_CUDA_ALLOC_CONF RAY_ADDRESS=$RAY_GCS_ADDRESS $RUN_COMMAND"

federico-dambrosio avatar Oct 10 '25 12:10 federico-dambrosio

Hi @federico-dambrosio were you able to determine if it was indeed a propagation issue? A setup like:

export HF_HOME=$HF_HOME
export HF_HUB_OFFLINE=1
...
srun --export=ALL -N1 -n1 -c1 \
...

(and/or inline exports inside the container like bash -c "export HF_HOME=$HF_HOME && export HF_HUB_OFFLINE=1 && ...) may work in this case.

An explicit

import os

os.environ["HF_HUB_OFFLINE"] = "1"

at the very top of your Python script (i.e., before all other imports) may work too.

It sounds like you are already setting the cache_dir correctly, but I wanted to verify that it looks something like /your/path/to/.cache/huggingface/hub?

sarahyurick avatar Oct 11 '25 02:10 sarahyurick

Hey @sarahyurick, thank you for the feedback! After a few tries and different ways to make sure the variables were propagated correctly, the pipeline started. As far as I know, --export=ALL is done default, even without specifying it and what was weird is that checking the HF_HUB_OFFLINE variable inside the pipeline script gave me 1.

I think the trick in my case was that setting the variable only for the executed command (so, not exporting the variables inside the bash command) was not enough. Maybe they were set for the main process, and not propagated to the children processes in the same containers?

So, I tried in 2 ways:

  • like you suggested, explicitly exporting the variables -> it worked
  • exporting SINGULARITYENV_HF_HOME and SINGULARITYENV_HF_HUB_OFFLINE in the main sbatch script, which then are propagated to the singularity container process -> it worked

federico-dambrosio avatar Oct 13 '25 07:10 federico-dambrosio