argilla icon indicating copy to clipboard operation
argilla copied to clipboard

[FEATURE] Controls for data schema for images when exporting datasets and records

Open burtenshaw opened this issue 5 months ago • 1 comments

Is your feature request related to a problem? Please describe.

When using argilla responses in a downstream task like model training, only some of the information from argilla is necessary. Mainly the responses to questions.

Also, if Argilla datasets contain larger media formats like images, getting just these responses is cumbersome and time consuming. Users might want to skip these fields, or get the original local file paths.

Describe the solution you'd like

  • A simple solution is to support with_fields=False in DatasetRecords so that a user can iterate over only the responses and align them with the source dataset based on record id
  • A more advance feature would allow the user to define a mapping between argilla and a hf dataset. In the same way that DatasetRecord.log works. So that sub components of Argilla fields and questions could be assigned to specific dataset columns, using dot notation.
  • For ImageField specifically, a record attribute that relates to other string formats of images could be stored (url, uri, filepaths), so that users can retrieve those instead of the PIL object.

Describe alternatives you've considered

The only current solution is to export everything to_datasets and drop or manipulat rows.

Additional context

burtenshaw avatar Sep 04 '24 10:09 burtenshaw