data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Bug] Error running lang_id and code_quality kfp pipelines

Open revit13 opened this issue 9 months ago • 3 comments
trafficstars

Search before asking

  • [x] I searched the issues and found no similar issues.

Component

KFP V1 workflows

What happened + What you expected to happen

running make workflow-test under language/lang_id gives the following error:

(orchestrate pid=273, ip=10.244.2.16) 10:35:49 INFO - Cluster resources: {'cpus': 4, 'gpus': 0, 'memory': 12.0, 'object_store': 3.1541931135579944}
(orchestrate pid=273, ip=10.244.2.16) 10:35:49 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each
(RayTransformFileProcessor pid=273, ip=10.244.1.19) 10:35:50 ERROR - Exception creating transform  401 Client Error. (Request ID: Root=1-67aa4706-032e880645c8d1a22105e459;60b4b4c8-3d63-4ea1-97bb-42a7c98c2028)
(RayTransformFileProcessor pid=273, ip=10.244.1.19)
(RayTransformFileProcessor pid=273, ip=10.244.1.19) Repository Not Found for url: https://huggingface.co/facebook/fasttext-language-identification/resolve/main/model.bin.
(RayTransformFileProcessor pid=273, ip=10.244.1.19) Please make sure you specified the correct `repo_id` and `repo_type`.
(RayTransformFileProcessor pid=273, ip=10.244.1.19) If you are trying to access a private or gated repo, make sure you are authenticated.
(RayTransformFileProcessor pid=273, ip=10.244.1.19) Invalid credentials in Authorization header
(RayTransformFileProcessor pid=273, ip=10.244.1.19) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RayTransformFileProcessor.__init__() (pid=273, ip=10.244.1.19, actor_id=777db29ec2b2a2ac5d1202fd02000000, repr=<data_processing_ray.runtime.ray.transform_file_processor.RayTransformFileProcessor object at 0x7f45b4d5af20>)
(RayTransformFileProcessor pid=273, ip=10.244.1.19)   File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
(RayTransformFileProcessor pid=273, ip=10.244.1.19)     raise HTTPError(http_error_msg, response=self)
(RayTransformFileProcessor pid=273, ip=10.244.1.19) requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/facebook/fasttext-language-identification/resolve/main/model.bin
(RayTransformFileProcessor pid=273, ip=10.244.1.19)
(RayTransformFileProcessor pid=273, ip=10.244.1.19) The above exception was the direct cause of the following exception:
(RayTransformFileProcessor pid=273, ip=10.244.1.19)
(RayTransformFileProcessor pid=273, ip=10.244.1.19) ray::RayTransformFileProcessor.__init__() (pid=273, ip=10.244.1.19, actor_id=777db29ec2b2a2ac5d1202fd02000000, repr=<data_processing_ray.runtime.ray.transform_file_processor.RayTransformFileProcessor object at 0x7f45b4d5af20>)
(RayTransformFileProcessor pid=273, ip=10.244.1.19)   File "/home/ray/anaconda3/lib/python3.10/site-packages/data_processing_ray/runtime/ray/transform_file_processor.py", line 48, in __init__

running make workflow-test under code/code_quality gives the following error:

(orchestrate pid=273, ip=10.244.1.16) 10:25:52 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each
(RayTransformFileProcessor pid=272, ip=10.244.2.12) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
(RayTransformFileProcessor pid=273, ip=10.244.2.12) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
(RayTransformFileProcessor pid=272, ip=10.244.2.12) 10:25:55 ERROR - Exception creating transform  codeparrot/codeparrot is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
(RayTransformFileProcessor pid=272, ip=10.244.2.12) If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
(RayTransformFileProcessor pid=272, ip=10.244.2.12) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RayTransformFileProcessor.__init__() (pid=272, ip=10.244.2.12, actor_id=748e3fe7abff912a43a93f8c02000000, repr=<data_processing_ray.runtime.ray.transform_file_processor.RayTransformFileProcessor object at 0x7f363d10f1f0>)
(RayTransformFileProcessor pid=272, ip=10.244.2.12)   File "/home/ray/anaconda3/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
(RayTransformFileProcessor pid=272, ip=10.244.2.12)     raise HTTPError(http_error_msg, response=self)
(RayTransformFileProcessor pid=272, ip=10.244.2.12) requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/codeparrot/codeparrot/resolve/main/tokenizer_config.json
(RayTransformFileProcessor pid=272, ip=10.244.2.12)
(RayTransformFileProcessor pid=272, ip=10.244.2.12) The above exception was the direct cause of the following exception:
(RayTransformFileProcessor pid=272, ip=10.244.2.12)
(RayTransformFileProcessor pid=272, ip=10.244.2.12) ray::RayTransformFileProcessor.__init__() (pid=272, ip=10.244.2.12, actor_id=748e3fe7abff912a43a93f8c02000000, repr=<data_processing_ray.runtime.ray.transform_file_processor.RayTransformFileProcessor object at 0x7f363d10f1f0>)

Reproduction script

running make workflow-test under language/lang_id and code/code_quality

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

revit13 avatar Feb 11 '25 04:02 revit13

Is it still observed. It seems like a temporary authentication wilth huggingface.

wget https://huggingface.co/facebook/fasttext-language-identification/resolve/main/model.bin

was working on my end without credentials.

shivdeep-singh-ibm avatar Feb 12 '25 11:02 shivdeep-singh-ibm

The same error is showing up with GneissWeb Classification and HAP.

E   Invalid credentials in Authorization header
=========================== short test summary info ============================
ERROR test_gneissweb_classification.py::TestLangIdentificationTransform - huggingface_hub.errors.HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/facebook/fasttext-language-identification/resolve/main/model.bin (Request ID: Root=1-67c4fd32-38256649061e6f1e1043d789;422f893d-b19f-4401-84ac-013a08f5b454)

touma-I avatar Mar 03 '25 17:03 touma-I

Propose modifying the workflow to securely share the HF token with external branches using pull_request_target and limit who can run the workflow to a select few.

touma-I avatar Mar 03 '25 17:03 touma-I