argilla
argilla copied to clipboard
[FEATURE] Controls for data schema for images when exporting datasets and records
Is your feature request related to a problem? Please describe.
When using argilla responses in a downstream task like model training, only some of the information from argilla is necessary. Mainly the responses to questions.
Also, if Argilla datasets contain larger media formats like images, getting just these responses is cumbersome and time consuming. Users might want to skip these fields, or get the original local file paths.
Describe the solution you'd like
- A simple solution is to support
with_fields=False
inDatasetRecords
so that a user can iterate over only the responses and align them with the source dataset based on recordid
- A more advance feature would allow the user to define a mapping between argilla and a hf dataset. In the same way that
DatasetRecord.log
works. So that sub components of Argilla fields and questions could be assigned to specific dataset columns, using dot notation. - For ImageField specifically, a record attribute that relates to other string formats of images could be stored (url, uri, filepaths), so that users can retrieve those instead of the PIL object.
Describe alternatives you've considered
The only current solution is to export everything to_datasets
and drop or manipulat rows.
Additional context