modelcards icon indicating copy to clipboard operation
modelcards copied to clipboard

Add ability to pull information from Transformers docs

Open NielsRogge opened this issue 2 years ago • 2 comments

Hi,

This library looks great already. It would be awesome if we can leverage it when adding new models to the Transformers library. For now, model cards are created manually (after pushing the checkpoints to the hub).

We typically duplicate things from the documentation into the model card (like the abstract of the paper, a code snippet showcasing basic usage, the URL of the paper, short description of the model, etc.).

For instance, check the ViT docs, which is then used in the model card of a ViT checkpoint.

So ideally, we could use this library in the conversion scripts, where we also add the model card when pushing to the model.

NielsRogge avatar Jun 09 '22 09:06 NielsRogge

This is what I was thinking in terms of integration with other libraries. There should be some way of dispatching the extraction of some extra information to the library itself, and not have that logic inside this library.

We can think of it as a method in the transformers library which passes a custom template to this library, and a function which would pre-fill some of the fields using the pre-existing information. This can be scaled to other libraries as well and they can host that logic internally.

adrinjalali avatar Jun 09 '22 09:06 adrinjalali

Yea this should be possible, and is exactly the kind of thing this utility is meant to help do.

To your points:

  • Abstract: Looks easy enough to get if this structure is consistent across docs pages
  • Description: Looks easy enough...just grab up til abstract header.
  • URL of paper: 50/50 on this...could get tricky if there's more than one link in the description.
  • Bibtext (mentioned offline): This one I couldn't find in the docs, so don't think its possible?

Either way, a template could be written and used to do this.

docs_path = 'vit_docs.md'

# Define this logic w regex
abstract = scrape_for_abstract(docs_path)
description = scrape_for_description(docs_path)
url_of_paper = scrape_for_paper_link(docs_path)

template_kwargs = dict(
    abstract=abstract,
    description=description,
    url_of_paper=url_of_paper,
)

# Would have to fill in relevant metadata here, some of which we wont be able to scrape for
card_data = CardData(...)

card = ModelCard.from_template(
    card_data,
    template_path='transformers_template.md',
    **template_kwargs
)

Where 'transformers_template.md' is template we put together that has Jinja variables with same names as keys in template_kwargs.

We can definitely give this a go once we're really happy with the content in the default model card template, as we'll be able to draw suggested sections from there.

nateraw avatar Jun 09 '22 19:06 nateraw