argilla
argilla copied to clipboard
[BUG-python/deployment] Cannot upload dataset with responses and `user_id`s
Describe the bug There seems to be a problem while uploading a dataset that is already annotated and has user_ids to the server. If I delete all user_ids, then it is uploaded, which is not ideal.
Stacktrace and Code to create the bug
import argilla as rg
multilabel_ds_local = rg.FeedbackDataset.from_huggingface("argilla/go_emotions_raw", split="train[:100]")
multilabel_ds_local.push_to_argilla("go_dataset")
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:3 │
│ │
│ 1 import argilla as rg │
│ 2 multilabel_ds_local = rg.FeedbackDataset.from_huggingface("argilla/go_emotions_raw", spl │
│ ❱ 3 multilabel_ds_local.push_to_argilla("go_dataset") │
│ 4 │
│ │
│ /Users/kursat/argilla/src/argilla/client/feedback/dataset/local/mixins.py:287 in push_to_argilla │
│ │
│ 284 │ │ │ return remote_dataset │
│ 285 │ │ except Exception as ex: │
│ 286 │ │ │ ArgillaMixin.__delete_dataset(client=httpx_client, id=created_dataset.id) │
│ ❱ 287 │ │ │ raise ex │
│ 288 │ │
│ 289 │ @staticmethod │
│ 290 │ def __get_fields(client: "httpx.Client", id: UUID) -> List["AllowedRemoteFieldTypes" │
│ │
│ /Users/kursat/argilla/src/argilla/client/feedback/dataset/local/mixins.py:279 in push_to_argilla │
│ │
│ 276 │ │ │ ) │
│ 277 │ │ │ │
│ 278 │ │ │ if len(self.records) > 0: │
│ ❱ 279 │ │ │ │ remote_dataset.add_records(self.records, show_progress) │
│ 280 │ │ │ │
│ 281 │ │ │ _LOGGER.info("✓ Dataset succesfully pushed to Argilla") │
│ 282 │ │ │ _LOGGER.info(remote_dataset) │
│ │
│ /Users/kursat/argilla/src/argilla/client/feedback/dataset/remote/dataset.py:593 in add_records │
│ │
│ 590 │ │ │ │ `FeedbackRecord`, list of dictionaries as a record or dictionary as a │
│ 591 │ │ │ │ record; or if the given records do not match the expected schema. │
│ 592 │ │ """ │
│ ❱ 593 │ │ self._records.add(records=records, show_progress=show_progress) │
│ 594 │ │
│ 595 │ @allowed_for_roles(roles=[UserRole.owner, UserRole.admin]) │
│ 596 │ def find_similar_records( │
│ │
│ /Users/kursat/argilla/src/argilla/client/utils.py:61 in wrapper │
│ │
│ 58 │ │ │ │ │ f"User with role={user.role} is not allowed to call `{func.__name__} │
│ 59 │ │ │ │ │ f" Only users with role={roles} are allowed to call this function." │
│ 60 │ │ │ │ ) │
│ ❱ 61 │ │ │ return func(*args, **kwargs) │
│ 62 │ │ │
│ 63 │ │ return wrapper │
│ 64 │
│ │
│ /Users/kursat/argilla/src/argilla/client/feedback/dataset/remote/dataset.py:189 in add │
│ │
│ 186 │ │ │ │
│ 187 │ │ │ for i in range(0, len(records), PUSHING_BATCH_SIZE): │
│ 188 │ │ │ │ batch = records[i : i + PUSHING_BATCH_SIZE] │
│ ❱ 189 │ │ │ │ datasets_api_v1.add_records( │
│ 190 │ │ │ │ │ client=self._client, │
│ 191 │ │ │ │ │ id=self.dataset.id, │
│ 192 │ │ │ │ │ records=[record.to_server_payload(question_name_to_id=question_name_ │
│ │
│ /Users/kursat/argilla/src/argilla/client/sdk/v1/datasets/api.py:327 in add_records │
│ │
│ 324 │ │
│ 325 │ if response.status_code == 204: │
│ 326 │ │ return Response.from_httpx_response(response) │
│ ❱ 327 │ return handle_response_error(response) │
│ 328 │
│ 329 │
│ 330 def update_records( │
│ │
│ /Users/kursat/argilla/src/argilla/client/sdk/commons/errors_handler.py:63 in │
│ handle_response_error │
│ │
│ 60 │ │ error_type = GenericApiError │
│ 61 │ else: │
│ 62 │ │ raise HttpResponseError(response=response) │
│ ❱ 63 │ raise error_type(**error_args) │
│ 64 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValidationApiError: Argilla server returned an error with http status: 422. Error details: {'response': 'Record at
position 0 is not valid because response at position 0 is not valid: user_id=00000000-0000-0000-0000-000000000001
does not exist', 'params': None}
Expected behavior A clear and concise description of what you expected to happen.
Environment:
- Argilla Version [e.g. 1.0.0]:
- ElasticSearch Version [e.g. 7.10.2]:
- Docker Image (optional) [e.g. argilla:v1.0.0]:
Additional context Add any other context about the problem here.
🤖
Hello @kursathalat! 👋 I'm Dosu, a friendly bot here to assist you with bugs, answer your questions, and guide you on your open-source journey. While we're waiting for a human maintainer, feel free to ask me anything about the Argilla project. I'm here to make your coding life easier! 🚀
Thank you for providing detailed information about the issue you're experiencing. The error message you're seeing is due to the Argilla platform validating the user_id
s associated with the dataset. If a user_id
does not exist in the system, it raises a ValueError
.
This is evident from the validate_user
function in the Argilla codebase:
async def validate_user(db: "AsyncSession", user_id: UUID, users_ids: Optional[Set[UUID]]) -> Set[UUID]:
if not users_ids:
users_ids = set()
if user_id not in users_ids:
if not await accounts.user_exists(db, user_id):
raise ValueError(f"user_id={str(user_id)} does not exist")
users_ids.add(user_id)
return users_ids
In this function, it checks if the user_id
exists in the system by calling the accounts.user_exists
function. If the user_id
does not exist, it raises a ValueError
with the message "user_id={str(user_id)} does not exist". This is likely the validation error you are encountering. You can find this function in the src/argilla/server/contexts/datasets.py file.
As for the push_to_argilla
function, it is used to push a FeedbackDataset
to Argilla. This function takes in the name of the dataset, an optional workspace, and an optional show_progress
parameter. It returns a RemoteFeedbackDataset
which is the dataset that has been pushed to Argilla. You can find more about this function in the src/argilla/client/feedback/dataset/local/mixins.py file.
I also found a similar issue, ValidationError When Logging Data, which was resolved by updating the version of Rubrix. However, I'm not sure if this solution applies to your case.
Please ensure that the user_id
s in your dataset exist in the system before pushing the dataset to Argilla. If you continue to experience this issue, could you please provide more information about your environment, such as the Argilla version, ElasticSearch version, and Docker Image you're using? This will help us better understand and address the issue.
Sources
- ValidationError When Logging Data
- src/argilla/server/contexts/datasets.py
- src/argilla/client/feedback/dataset/local/mixins.py
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
This will be tackled together with: https://github.com/argilla-io/argilla/issues/4328