argilla icon indicating copy to clipboard operation
argilla copied to clipboard

[BUG-python/deployment] Cannot upload dataset with responses and `user_id`s

Open kursathalat opened this issue 1 year ago • 1 comments

Describe the bug There seems to be a problem while uploading a dataset that is already annotated and has user_ids to the server. If I delete all user_ids, then it is uploaded, which is not ideal.

Stacktrace and Code to create the bug

import argilla as rg
multilabel_ds_local = rg.FeedbackDataset.from_huggingface("argilla/go_emotions_raw", split="train[:100]")
multilabel_ds_local.push_to_argilla("go_dataset")
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:3                                                                                    │
│                                                                                                  │
│   1 import argilla as rg                                                                         │
│   2 multilabel_ds_local = rg.FeedbackDataset.from_huggingface("argilla/go_emotions_raw", spl     │
│ ❱ 3 multilabel_ds_local.push_to_argilla("go_dataset")                                            │
│   4                                                                                              │
│                                                                                                  │
│ /Users/kursat/argilla/src/argilla/client/feedback/dataset/local/mixins.py:287 in push_to_argilla │
│                                                                                                  │
│    284 │   │   │   return remote_dataset                                                         │
│    285 │   │   except Exception as ex:                                                           │
│    286 │   │   │   ArgillaMixin.__delete_dataset(client=httpx_client, id=created_dataset.id)     │
│ ❱  287 │   │   │   raise ex                                                                      │
│    288 │                                                                                         │
│    289 │   @staticmethod                                                                         │
│    290 │   def __get_fields(client: "httpx.Client", id: UUID) -> List["AllowedRemoteFieldTypes"  │
│                                                                                                  │
│ /Users/kursat/argilla/src/argilla/client/feedback/dataset/local/mixins.py:279 in push_to_argilla │
│                                                                                                  │
│    276 │   │   │   )                                                                             │
│    277 │   │   │                                                                                 │
│    278 │   │   │   if len(self.records) > 0:                                                     │
│ ❱  279 │   │   │   │   remote_dataset.add_records(self.records, show_progress)                   │
│    280 │   │   │                                                                                 │
│    281 │   │   │   _LOGGER.info("✓ Dataset succesfully pushed to Argilla")                       │
│    282 │   │   │   _LOGGER.info(remote_dataset)                                                  │
│                                                                                                  │
│ /Users/kursat/argilla/src/argilla/client/feedback/dataset/remote/dataset.py:593 in add_records   │
│                                                                                                  │
│    590 │   │   │   │   `FeedbackRecord`, list of dictionaries as a record or dictionary as a     │
│    591 │   │   │   │   record; or if the given records do not match the expected schema.         │
│    592 │   │   """                                                                               │
│ ❱  593 │   │   self._records.add(records=records, show_progress=show_progress)                   │
│    594 │                                                                                         │
│    595 │   @allowed_for_roles(roles=[UserRole.owner, UserRole.admin])                            │
│    596 │   def find_similar_records(                                                             │
│                                                                                                  │
│ /Users/kursat/argilla/src/argilla/client/utils.py:61 in wrapper                                  │
│                                                                                                  │
│    58 │   │   │   │   │   f"User with role={user.role} is not allowed to call `{func.__name__}   │
│    59 │   │   │   │   │   f" Only users with role={roles} are allowed to call this function."    │
│    60 │   │   │   │   )                                                                          │
│ ❱  61 │   │   │   return func(*args, **kwargs)                                                   │
│    62 │   │                                                                                      │
│    63 │   │   return wrapper                                                                     │
│    64                                                                                            │
│                                                                                                  │
│ /Users/kursat/argilla/src/argilla/client/feedback/dataset/remote/dataset.py:189 in add           │
│                                                                                                  │
│    186 │   │   │                                                                                 │
│    187 │   │   │   for i in range(0, len(records), PUSHING_BATCH_SIZE):                          │
│    188 │   │   │   │   batch = records[i : i + PUSHING_BATCH_SIZE]                               │
│ ❱  189 │   │   │   │   datasets_api_v1.add_records(                                              │
│    190 │   │   │   │   │   client=self._client,                                                  │
│    191 │   │   │   │   │   id=self.dataset.id,                                                   │
│    192 │   │   │   │   │   records=[record.to_server_payload(question_name_to_id=question_name_  │
│                                                                                                  │
│ /Users/kursat/argilla/src/argilla/client/sdk/v1/datasets/api.py:327 in add_records               │
│                                                                                                  │
│   324 │                                                                                          │
│   325 │   if response.status_code == 204:                                                        │
│   326 │   │   return Response.from_httpx_response(response)                                      │
│ ❱ 327 │   return handle_response_error(response)                                                 │
│   328                                                                                            │
│   329                                                                                            │
│   330 def update_records(                                                                        │
│                                                                                                  │
│ /Users/kursat/argilla/src/argilla/client/sdk/commons/errors_handler.py:63 in                     │
│ handle_response_error                                                                            │
│                                                                                                  │
│   60 │   │   error_type = GenericApiError                                                        │
│   61 │   else:                                                                                   │
│   62 │   │   raise HttpResponseError(response=response)                                          │
│ ❱ 63 │   raise error_type(**error_args)                                                          │
│   64                                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValidationApiError: Argilla server returned an error with http status: 422. Error details: {'response': 'Record at 
position 0 is not valid because response at position 0 is not valid: user_id=00000000-0000-0000-0000-000000000001 
does not exist', 'params': None}

Expected behavior A clear and concise description of what you expected to happen.

Environment:

  • Argilla Version [e.g. 1.0.0]:
  • ElasticSearch Version [e.g. 7.10.2]:
  • Docker Image (optional) [e.g. argilla:v1.0.0]:

Additional context Add any other context about the problem here.

kursathalat avatar Dec 19 '23 08:12 kursathalat

🤖

Hello @kursathalat! 👋 I'm Dosu, a friendly bot here to assist you with bugs, answer your questions, and guide you on your open-source journey. While we're waiting for a human maintainer, feel free to ask me anything about the Argilla project. I'm here to make your coding life easier! 🚀

Thank you for providing detailed information about the issue you're experiencing. The error message you're seeing is due to the Argilla platform validating the user_ids associated with the dataset. If a user_id does not exist in the system, it raises a ValueError.

This is evident from the validate_user function in the Argilla codebase:

async def validate_user(db: "AsyncSession", user_id: UUID, users_ids: Optional[Set[UUID]]) -> Set[UUID]:
    if not users_ids:
        users_ids = set()

    if user_id not in users_ids:
        if not await accounts.user_exists(db, user_id):
            raise ValueError(f"user_id={str(user_id)} does not exist")
        users_ids.add(user_id)

    return users_ids

In this function, it checks if the user_id exists in the system by calling the accounts.user_exists function. If the user_id does not exist, it raises a ValueError with the message "user_id={str(user_id)} does not exist". This is likely the validation error you are encountering. You can find this function in the src/argilla/server/contexts/datasets.py file.

As for the push_to_argilla function, it is used to push a FeedbackDataset to Argilla. This function takes in the name of the dataset, an optional workspace, and an optional show_progress parameter. It returns a RemoteFeedbackDataset which is the dataset that has been pushed to Argilla. You can find more about this function in the src/argilla/client/feedback/dataset/local/mixins.py file.

I also found a similar issue, ValidationError When Logging Data, which was resolved by updating the version of Rubrix. However, I'm not sure if this solution applies to your case.

Please ensure that the user_ids in your dataset exist in the system before pushing the dataset to Argilla. If you continue to experience this issue, could you please provide more information about your environment, such as the Argilla version, ElasticSearch version, and Docker Image you're using? This will help us better understand and address the issue.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Dec 19 '23 08:12 dosubot[bot]

This will be tackled together with: https://github.com/argilla-io/argilla/issues/4328

nataliaElv avatar Feb 29 '24 10:02 nataliaElv