modelcards
modelcards copied to clipboard
Support for datacards/datasheets
Thanks for this library -- I've just started playing with this, and it looks like it is going to be super useful :)
Are there any plans for also supporting the creation of datacard/datasheets in this library?
I think this could be quite useful for a few use cases. In particular being able to template out some standard information might be useful for organizations which might want to standardize some information in a Datacard for example, in https://huggingface.co/datasets/BritishLibraryLabs/EThOS-PhD-metadata, we may want to be able to pass in a list of names or OCRDIDs to go under https://huggingface.co/datasets/BritishLibraryLabs/EThOS-PhD-metadata#dataset-curators.
This could end up looking something like:
datacard = DataCard.from_template(
card_data=DataCardData( # Card metadata object that will be converted to YAML block
license='mit',
tags=['image-classification'],
...
),
template_path='my_data_template.md', # The template we just wrote!
dataset_id='cool-model', # Jinja template kwarg
external_url='data.bl....', # Jinja template kwarg
curators=['name1', 'name2']
)
I think this could also be useful for organizations/users using the hub to store data that is actively being developed/annotated. They could then use this feature to automagically create some key stats about the dataset i.e. number of instances, label frequency breakdowns, annotator agreement scores etc. and keep that documentation in sync with a changing dataset? I had planned to add something like this to https://github.com/davanstrien/hugit-cli/ but would rather piggyback on something else!
Absolutely! This was why I originally set up ModelCard
to inherit from RepoCard
. RepoCard
currently inits a CardData
object though, which is specific to models (which isn't really right). It would be great if we:
- added this feature as well as a default dataset card using the one here.
- Figure out a better way of instantiating the card data using the correct object (
CardData
/DataCardData
). - Also perhaps worth renaming
CardData
->ModelCardData
to avoid confusion?
As for this:
They could then use this feature to automagically create some key stats about the dataset i.e. number of instances, label frequency breakdowns, annotator agreement scores etc. and keep that documentation in sync with a changing dataset
Right now, once the card is written, it's just text. So there's no way of automagically updating the card's text itself without recreating the card. Recreating the card would be easy though, as you'd just pass the updated values to the from_template
fn again and re-push the new card to overwrite the old. Does that work for you?
CC: @lhoestq @mariosasko - This might be nice for folks who are creating their own datasets programatically.
Small update here - currently, you can abuse ModelCard.from_template
and CardData
to upload data sheets/data cards.
I'm doing that here. None of the fields in CardData
are required, so you can just pass whatever you want in the yaml header data. When pushing, just make sure to supply repo_type="dataset"
and it'll validate the yaml you create against the dataset YAML block schema (which actually doesn't have any required fields...it just requires that the YAML block isn't empty).
Thanks! I was planning to play around with this a bit more tomorrow -- I'll let you know how I get on.