datasets Domain specific dataset discovery on the Hugging Face hub

Is your feature request related to a problem? Please describe.

The problem

The datasets hub currently has 8,239 datasets. These datasets span a wide range of different modalities and tasks (currently with a bias towards textual data).

There are various ways of identifying datasets that may be relevant for a particular use case:

searching
various filters

Currently, however, there isn't an easy way to identify datasets belonging to a specific domain. For example, I want to browse machine learning datasets related to 'social science' or 'climate change research'.

The ability to identify datasets relating to a specific domain has come up in discussions around the BigLA datasets hackathon https://github.com/bigscience-workshop/lam/discussions/31#discussioncomment-3123610. As part of the hackathon, we're currently collecting datasets related to Libraries, Archives and Museums and making them available via the hub. We currently do this under a Hugging Face organization (https://huggingface.co/biglam). However, going forward, I can see some of these datasets being migrated to sit under an organization that is the custodian of the dataset (for example, a national library the data was originally from). At this point, it becomes more difficult to quickly identify datasets from this domain without relying on search.

This is also related to some existing issues on Github related to metadata on the hub:

https://github.com/huggingface/datasets/issues/3625
https://github.com/huggingface/datasets/issues/3877

Describe the solution you'd like

Some possible solutions that may help with this:

Enable domain tags (from a controlled vocabulary)

This would add metadata field to the YAML for the domain a dataset relates to
Advantages:
- the list is controlled, allowing it to be more easily integrated into the datasets tag app (https://huggingface.co/space/huggingface/datasets-tagging)
- the controlled vocabulary could align with an existing controlled vocabulary
- this additional metadata can be used to perform filtering by domain
disadvantages
- choosing the best controlled vocab may be difficult
- there are many datasets that are likely to fit into the 'machine learning' domain (i.e. there is a long tail of datasets that aren't in more 'generic' machine learning domain

Enable topic tags (user-generated)

Enable 'free form' topic tags for datasets and models. This would be closer to GitHub's repository topics which can be chosen from a controlled list (https://github.com/topics/) but can also be more user/org specific. This could potentially be useful for organizations to also manage their own models and datasets as the number they hold in their org grows. For example, they may create 'topic tags' for a specific project, so it's clearer which datasets /models are related to that project.

Collections

This solution would likely be the biggest shift and may require significant changes in the hub fronted. Collections could work in several different ways but would include:

Users can curate particular datasets, models, spaces, etc., into a collection. For example, they may create a collection of 'historic newspapers suitable for training language models'. These collections would not be mutually exclusive, i.e. a dataset can belong to zero, one or many collections. Collections can also potentially be nested under other collections.

This is fairly common on other data reposotiores for example the following collections: Screenshot 2022-07-18 at 11 50 44

all belong under a higher level collection (https://bl.iro.bl.uk/collections/353c908d-b495-4413-b047-87236d2573e3?locale=en).

There are different models one could use for how these collections could be created:

only within an org
for any dataset/model
the owner or a dataset/model has to agree to be added to a collection
a collection owner can have people suggest additions to their collection
other models....

These collections could be thematic, related to particular training approaches, curate models with particular inference properties etc. Whilst some of these features may duplicate current/or future tag filters on the hub, they offer the advantage of being flexible and not having to predict what users will want to do upfront.

There is also potential for automating the creation of these collections based on existing metadata. For example, one could collect models trained on a collection of datasets so for example, if we had a collection of 'historic newspapers suitable for training language models' that contained 30 datasets, we could create another collection 'historic newspaper language models' that takes any model on the hub whose metadata says it used one or more of those 30 datasets.

There is also the option of exploring ML approaches to suggest models/datasets may be relevant to a particular collection.

This approach is likely to be quite difficult to implement well and would require significant thought. There is also likely to be a benefit in doing quite a bit of upfront work in curating useful collections to demonstrate the benefits of collections.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

It is possible to collate this information externally, i.e. one could link back to the relevant models/datasets from an external platform.

Additional context Add any other context about the feature request here.

I'm cc'ing others involved in the BigLAM hackathon who may also have thoughts @cakiki @clancyoftheoverflow @albertvillanova

Jul 18 '22 11:07 davanstrien

Hi! I added a link to this issue in our internal request for adding keywords/topics to the Hub, which is identical to the topic tags solution. The collections solution seems too complex (as you point out). Regarding the domain tags solution, we primarily focus on machine learning, so I'm not sure if it's a good idea to make our current taxonomy more complex.

Jul 18 '22 12:07 mariosasko

Hi! I added a link to this issue in our internal request for adding keywords/topics to the Hub, which is identical to the topic tags solution. The collections solution seems too complex (as you point out). Regarding the domain tags solution, we primarily focus on machine learning, so I'm not sure if it's a good idea to make our current taxonomy more complex.

Thanks, for letting me know. Will you allow the topic tags to be user-generated or only chosen from a list?

Jul 18 '22 13:07 davanstrien

Thanks for opening this issue @davanstrien.

As we discussed last week, the tag approach would be in principle the simpler to be implemented, either the domain tag (with closed vocabulary: more reliable but also more rigid), or the topic tag (with open vocabulary: more flexible for user needs)

Jul 18 '22 14:07 albertvillanova

Hi @davanstrien If i remember correctly this was also discussed inside a hf.co Discussion, would you be able to link it here too?

(where i suggested using tags: - foo - bar IIRC.

Thanks a ton!

Jul 19 '22 10:07 julien-c

Hi @davanstrien If i remember correctly this was also discussed inside a hf.co Discussion, would you be able to link it here too?

(where i suggested using tags: - foo - bar IIRC.

Thanks a ton!

This doesn't ring a bell - I did a quick search of https://discuss.huggingface.co but didn't find anything.

The tags: approach sounds like a good option for this. It would be especially nice if these could suggest existing tags, but this probably won't be easily possible through the current interface.

Jul 19 '22 12:07 davanstrien

I opened a PR to add "tags" to the YAML validator: https://github.com/huggingface/datasets/pull/4716

I also added "tags" to the tagging app, with suggestions like "bio" or "newspapers"

Jul 19 '22 12:07 lhoestq

Thanks @lhoestq for the initiative.

Just one question: are "tags" already supported on the Hub?

I think they aren't. Thus, the Hub should support them so that they are properly displayed.

Jul 19 '22 12:07 albertvillanova

I think they're not displayed, but at least it should enable users to filter by tag in using huggingface_hub or using the appropriate query params on the website (not sure if it's possible yet though)

Jul 19 '22 13:07 lhoestq

I think they're not displayed, but at least it should enable users to filter by tag in using huggingface_hub or using the appropriate query params on the website (not sure if it's possible yet though)

I think this would already be a helpful start. I'm happy to try this out with the datasets added to https://huggingface.co/organizations/biglam and use the huggingface_hub to filter those datasets using the tags.

Jul 19 '22 15:07 davanstrien

Is this abandoned? I'm looking for a transport logistics dataset; how can I find one?

Feb 10 '24 21:02 younes-io

@younes-io Full text search is probably your best bet: https://huggingface.co/search/full-text?type=dataset

Feb 12 '24 09:02 julien-c

datasets datasets copied to clipboard

Domain specific dataset discovery on the Hugging Face hub

The problem

Some possible solutions that may help with this:

Enable domain tags (from a controlled vocabulary)

Enable topic tags (user-generated)

Collections

datasets
datasets copied to clipboard