huggingface_hub
huggingface_hub copied to clipboard
Replace constrained strings (model/task names...) with enums.
A number of strings across the project are actually constrained in value (task names, model names, ...). Using enums instead of strings would (on the user side):
- constrain choice
- allow users to have rapid access to the complete available list of models/tasks,
- prevent stupid mistakes due to typos.
On the dev side, if some strings are hard-coded somewhere and need to get changed later on, it would be less of a hassle to manage with enums. (And enums are in the standard since 3.4?)
Wdyt?
+1 to this. I might even go as far as saying that it might be worth extracting the tasks out to their own Python package (whether it's in this repo or a different one) that can be used across our major libraries (e.g. Evaluate, Transformers, Datasets, huggingface_hub, ...) so that there's some synchronization on what tasks are available, what kind of data they need, their various associated metadata, etc. For example, I know that there are efforts over on Datasets to standardize inputs/outputs for the various tasks, which I think would be valuable to have centralized so that those outcomes can more easily be distributed to the other projects.
From an engineering perspective IMO it would make navigating our codebases much easier. If there's interest in this, I can see if it can be folded into the work that I'm already doing right now since it's kind of relevant.
(cc: @lvwerra, @osanseviero, @julien-c)
EDIT: I should add that I know we have Types.ts in hub-docs, but I feel like something like this should exist in Python to avoid manually keeping data formats synced, like in Evaluate:
Methods in this class assume a data format compatible with the [`TextClassificationPipeline`] - a single textual
feature as input and a categorical label as output.
FYI we have a GitHub actions workflow https://github.com/huggingface/hub-docs/blob/main/.github/workflows/js-interfaces-tests.yml#L30-L34 (https://github.com/huggingface/hub-docs/pull/96) that exports the types automatically, which can be copy-pasted in datasets https://github.com/huggingface/datasets/pull/4066, for example, has the JSON generated (https://github.com/huggingface/datasets/blob/main/src/datasets/utils/resources/tasks.json)
We could have it in Python easily accessible in the context of the task, I would not create another library, though. huggingface_hub is the centralized library used across all other libraries (it's already a dependency of evaluating, transformers, datasets, etc), so we could have the task structure directly within huggingface_hub.
EDIT: I should add that I know we have Types.ts in hub-docs, but I feel like something like this should exist in Python to avoid manually keeping data formats synced, like in Evaluate
But this is not necessarily a task sync problem, this is a sync problem between the evaluator object and the I/O of transformers pipeline . One topic is task taxonomy and the other is standardized I/O for different tasks.
Good point, I guess I saw the taxonomy and standardized I/O for the tasks as being related. Do you think the standardized I/O is something that could be put into huggingface_hub, to be accessed by Evaluate and the other libraries?
Yes, one could do some I/O standardization, but maintaining and scaling can become very challenging. Even for the Inference API of transformers, different things could happen for the same given task. Additionally, the Hub should avoid enforcing I/O mechanisms to the code in other libraries, which might complicate things. We have Inference API validation for every task at https://github.com/huggingface/api-inference-community/blob/main/api_inference_community/validation.py#L165-L192, but it's unclear to me if having this maintained somewhere in huggingface_hub makes sense.
cc @lhoestq as well WDYT?
not sure what I/O standardization is, but i think maybe that's similar to what Gradio has here?
https://github.com/gradio-app/gradio/blob/5089053052464164a95b1da02a2ca1b0fda0ce60/gradio/external.py#L157-L317
FWIW we don't do any validation in datasets anymore for the task names, this is done on the Hub side.
IMO as some libraries are adding support for more and more tasks, it's maybe simpler to not have to wait for a new release of hfh every time. However I'm not against documenting something simpler to read than Types.ts
Regarding the original issue of using enums, I would expect a library built on top of hfh to have its own enum with the list of tasks it supports.
i think maybe that's similar to what Gradio has here?
@julien-c Sort of! But I'm thinking rather to have that all available OOP-style, so that we can just import it into the library that we're working on and have immediate access to all the pre/post-processing, the tasks themselves, the types for input + output, all with intellisense since it would be done in Python.
it's maybe simpler to not have to wait for a new release of hfh every time
@lhoestq Agreed! That's why I was thinking of a separate repo/library for it, decoupled from hfh.
Even for the Inference API of transformers, different things could happen for the same given task
@osanseviero Wouldn't that be something that having a stronger & centralized type system could help with? I might be missing something, because it seems difficult right now to keep track of the data formats for the tasks and to keep them consistent. Note that I'm specifically looking at Evaluate x Transformers right now though, so maybe things are different elsewhere and it might not really be much of an issue.
Hey there :wave: I haven't contributed much to the discussion as I did not have much input to provide here.
Given what has been said here maybe we could:
- Have enums similar to
tasks.jsonthat are generated automatically from thetypes.tsfile. Something like:
class InferenceTaskType(str, Enum):
audio = "audio"
nlp = "nlp"
(...)
class InferenceTask(str, Enum):
audio_classification = "audio-classification"
feature_extraction = "feature-extraction"
(...)
class InferenceSubtask(str, Enum):
keyword_spotting = "keyword-spotting"
dialogue_generation = "dialogue-generation"
(...)
INFERENCE_TASKS_MAPPING = {
InferenceTask.conversational: { # maybe this as a dataclass with "type" and "subtasks" attributes ?
"type": InferenceTaskType.nlp,
"subtasks": [
InferenceSubtask.dialogue_generation,
]
},
}
-
Each time
types.tsis updated, it could automatically open a PR onhuggingface_hubthat we manually review and merge (at least at the beginning I wouldn't be a fan of pushing tomainwithout notice). -
By using a "string-enum" (see stackoverflow), any use of the enum could be replaced by a string value if necessary (e.g.
InferenceTask.audio_classification == "audio-classification"). This means that even if the list of tasks evolve over time, it is possible to directly use"audio-classification"in a downstream library beforeInferenceTask.audio_classificationbeing officially supported and released inhuggingface_hub. This is sub-optimal as one would still not be able to list the complete list of tasks from an old version ofhfhbut it can mitigate some limitations. I think the usefulness of such a taxonomy mostly depends on how frequently the list is expected to change. -
This proposition only address the taxonomy problem. For the I/O standardization I think it requires more discussion (not sure we are all on the same page for that).
WDYT ? Feedback is very welcomed :)
For transformers I would use https://huggingface.co/docs/transformers/main_classes/pipelines and https://huggingface.co/docs/api-inference/detailed_parameters
Although point 1 sounds nice, for me, it's very unclear what's the use of providing the task taxonomy through the huggingface_hub library. The only use case I see is for programmatic model card creation, but I think using strings directly would be more straightforward and avoid complexity. No strong opinions though, but this has not been a widely requested feature and having huggingface_hub having to be updated as the taxonomy changes might add complexity.
As for the I/O, I agree with
IMO as some libraries are adding support for more and more tasks, it's maybe simpler to not have to wait for a new release of hfh every time.
I understand the intentions behind
wouldn't that be something that having a stronger & centralized type system could help with?
but there are lots of complexity behind this. First of all, even for transformers, apart from the main inputs, a bunch of parameters can be passed. E.g. do we support temperature, top_k, etc? Some of these parameters might not make sense for other libraries. This could also explode quickly into a hard-to-maintain burden as decisions might change. For example, deciding the input/output for image-segmentation took months of discussions, and as we want to aim for more complex tasks (e.g. 3D object classification, etc), this will add layers of complexity.
Hi, for the whole I/O thing, I want to mention we have 1 document at least for pipelines: https://www.notion.so/huggingface2/Pipelines-Guidelines-1de828d4e56e4adb886253440657e13a
(Just so that we can try and share some definitions amongst ourselves)
Part of the complexity in transformers at least is the fact that we never made breaking changes, despite some inconsistencies between outputs (which made maintaining and OpenAPI spec super hard for instance, since nothing was exactly streamlined/consistent. I think any spec will have to deal with those. If/When a v5 comes, that'll be the first thing I propose to change, parameters will still be complex but at least we could have some nice spec with no parameters)
As for the original proposal.
I like enums from a maintainers perspective, but quite dislike them as a user. I hate to be "forced" to import some enum class to pass to some function, it just feels not-very-pythony. (And I come as a great admirer of rust enums power ! )
So as long as users never have to care about those enums, and they are just a means for maintainers to help with providing good errors, I'm 100% onboard.
use(task="audio-classification")
Is much simpler than
from hfh import Task #Is this even where the enum is located ? It's not necessarily clear from snippets.
use(task=Task.audio_classification)
No strong opinions though, but this has not been a widely requested feature and having huggingface_hub having to be updated as the taxonomy changes might add complexity.
I'm definitely not pushing to have this taxonomy if it's not requested :grin: If no-one sees a great benefit from it, no need to push for it. @clefourrier @NimaBoscarino could you just confirm that this taxonomy is not really what you were talking about but more the I/O stuff ? Or if you think it's helpful, what would be the use case ? :)
I don't have a strong opinion personally (on the actual need for having the equivalent of Types.ts in Python)
FYI we have a GitHub actions workflow https://github.com/huggingface/hub-docs/blob/main/.github/workflows/js-interfaces-tests.yml#L30-L34 (huggingface/hub-docs#96) that exports the types automatically, which can be copy-pasted in datasets huggingface/datasets#4066, for example, has the JSON generated (https://github.com/huggingface/datasets/blob/main/src/datasets/utils/resources/tasks.json)
FWIW datasets is almost not using this synced tasks.json anymore – is this correct @lhoestq?
Just using it maybe for https://huggingface.co/spaces/huggingface/datasets-tagging but that's it
We could have it in Python easily accessible in the context of the task, I would not create another library, though.
huggingface_hubis the centralized library used across all other libraries (it's already a dependency of evaluating, transformers, datasets, etc), so we could have the task structure directly withinhuggingface_hub.
we could also maintain the list in Python (in huggingface_hub) and auto-convert to Typescript (in hub-docs's Types.ts), but i'm not sure if that'd be very useful
could you just confirm that this taxonomy is not really what you were talking about but more the I/O stuff?
Yeah for me it was more about the I/O stuff, but @osanseviero makes good points re: the complexity, and the doc @Narsil shared helps me see that a bit too. I guess this would be too big of a lift with not enough payoff, and in my scenario maybe this kind of I/O standardization can just exist inside Evaluate if it's needed at all.
Completely missed this conversation! (better late than never I guess? ^^")
Yes, my point was mostly wrt I/O, to constrain user selections - I remember, for ex, having seen tasks names all along the spectrum of a_to_b, a2b, a_b, and I felt like this multiplication of names would make it harder for users to find what they were looking for. (Cannot find any trace of this now in the code)
Thanks to everybody who contributed their input - I get the added layers of complexity if would induce to add this, especially wrt other libs. I'll close this in the week, unless someone feels like it should be pushed further!