argilla icon indicating copy to clipboard operation
argilla copied to clipboard

[FEATURE] add support for infering `FeedbackDataset` structure in `from_huggingface` for transformer models

Open davidberenstein1957 opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe. I would like to focus on HF models.

Describe the solution you'd like https://huggingface.co/models has models categorized by task

import argilla as rg

rg.FeedbackDataset.from_huggingface(""ProsusAI/finbert")

Internally, something like this should happen, but Ideally we should avoid downloading the entire model and just use a config.

import argilla as rg
from transformers import pipeline
​
name = "sentiment-analysis"
pipe = pipeline(name)
​
ds = rg.FeedbackDataset.for_text_classification(
    labels=list(pipe.model.config.id2label.values()),
    multi_label=pipe.model.config.problem_type == "multi_label_classification"
)

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

davidberenstein1957 avatar Oct 25 '23 03:10 davidberenstein1957

Hi @davidberenstein1957, I think that re-using from_huggingface to not just load Argilla datasets dumped in the Hugging Face Hub, but also to load a configuration for any given model can be confusing to users and also confusing internally code-wise, so if this appears to happen I think we need to discuss about a proper method on doing so. Also the idea you propose I assume is to re-label already labelled datasets? If you could elaborate more over e.g. Notion and share with the team that would be great!

alvarobartt avatar Oct 25 '23 06:10 alvarobartt

Hi @alvarobartt, it is not something that is directly happening or was mentioned anywhere. However, I was just dreaming and thinking a bit and given that have gotten a lot of mentions that people don't understand how to use and configure the dataset so things like the task_templates could help for those. It is not used to re-label a dataset but more so to easily configure and link them. Similar to the reasoning about using a default embedding_model and text descriptions metadata for datasets.

davidberenstein1957 avatar Oct 25 '23 06:10 davidberenstein1957

I agree with @alvarobartt that from_huggingface might be confusing. I think this might be better placed in the task templates somehow but also we might want look at the bigger picture: associate hub model IDs with datasets for using them in different parts of the product (retraining, inference, etc.)

dvsrepo avatar Oct 25 '23 07:10 dvsrepo

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] avatar Jan 29 '24 01:01 github-actions[bot]

This issue was closed because it has been inactive for 30 days since being marked as stale.

github-actions[bot] avatar Apr 02 '24 01:04 github-actions[bot]