flair icon indicating copy to clipboard operation
flair copied to clipboard

datasets: add revision support for all Universal Dependencies classes

Open stefan-it opened this issue 2 years ago • 1 comments

Hi,

this PR adds a revision parameter to all Universal Dependencies classes.

Thus, it is possible to specify e.g. a specific commit version for reproducibility.

stefan-it avatar Mar 11 '24 13:03 stefan-it

I also tested all UD_* classes with the following script:

import flair

ud_treebanks = [
    flair.datasets.UD_ENGLISH(in_memory = False),
    flair.datasets.UD_GALICIAN(in_memory = False),
    flair.datasets.UD_ANCIENT_GREEK(in_memory = False),
    flair.datasets.UD_KAZAKH(in_memory = False),
    flair.datasets.UD_OLD_CHURCH_SLAVONIC(in_memory = False),
    flair.datasets.UD_ARMENIAN(in_memory = False),
    flair.datasets.UD_ESTONIAN(in_memory = False),
    flair.datasets.UD_GERMAN(in_memory = False),
    flair.datasets.UD_GERMAN_HDT(in_memory = False),
    flair.datasets.UD_DUTCH(in_memory = False),
    flair.datasets.UD_FAROESE(in_memory = False),
    flair.datasets.UD_FRENCH(in_memory = False),
    flair.datasets.UD_ITALIAN(in_memory = False),
    flair.datasets.UD_LATIN(in_memory = False),
    flair.datasets.UD_SPANISH(in_memory = False),
    flair.datasets.UD_PORTUGUESE(in_memory = False),
    flair.datasets.UD_ROMANIAN(in_memory = False),
    flair.datasets.UD_CATALAN(in_memory = False),
    flair.datasets.UD_POLISH(in_memory = False),
    flair.datasets.UD_CZECH(in_memory = False),
    flair.datasets.UD_SLOVAK(in_memory = False),
    flair.datasets.UD_SWEDISH(in_memory = False),
    flair.datasets.UD_DANISH(in_memory = False),
    flair.datasets.UD_NORWEGIAN(in_memory = False),
    flair.datasets.UD_FINNISH(in_memory = False),
    flair.datasets.UD_SLOVENIAN(in_memory = False),
    flair.datasets.UD_CROATIAN(in_memory = False),
    flair.datasets.UD_SERBIAN(in_memory = False),
    flair.datasets.UD_BULGARIAN(in_memory = False),
    flair.datasets.UD_ARABIC(in_memory = False),
    flair.datasets.UD_HEBREW(in_memory = False),
    flair.datasets.UD_TURKISH(in_memory = False),
    flair.datasets.UD_UKRAINIAN(in_memory = False),
    flair.datasets.UD_PERSIAN(in_memory = False),
    flair.datasets.UD_RUSSIAN(in_memory = False),
    flair.datasets.UD_HINDI(in_memory = False),
    flair.datasets.UD_INDONESIAN(in_memory = False),
    flair.datasets.UD_JAPANESE(in_memory = False),
    flair.datasets.UD_CHINESE(in_memory = False),
    flair.datasets.UD_KOREAN(in_memory = False),
    flair.datasets.UD_BASQUE(in_memory = False),
    flair.datasets.UD_CHINESE_KYOTO(in_memory = False),
    flair.datasets.UD_GREEK(in_memory = False),
    flair.datasets.UD_NAIJA(in_memory = False),
    flair.datasets.UD_LIVVI(in_memory = False),
    flair.datasets.UD_BURYAT(in_memory = False),
    flair.datasets.UD_NORTH_SAMI(in_memory = False),
    flair.datasets.UD_MARATHI(in_memory = False),
    flair.datasets.UD_MALTESE(in_memory = False),
    flair.datasets.UD_AFRIKAANS(in_memory = False),
    flair.datasets.UD_GOTHIC(in_memory = False),
    flair.datasets.UD_OLD_FRENCH(in_memory = False),
    flair.datasets.UD_WOLOF(in_memory = False),
    flair.datasets.UD_BELARUSIAN(in_memory = False),
    flair.datasets.UD_COPTIC(in_memory = False),
    flair.datasets.UD_IRISH(in_memory = False),
    flair.datasets.UD_LATVIAN(in_memory = False),
    flair.datasets.UD_LITHUANIAN(in_memory = False),
]

For UD_CZECH and UD_RUSSIAN the training files have changed, I fixed that.

Additionally, UD_BURYAT, UD_CHINESE_KYOTO and UD_NAIJA were not correctly registered, I also fixed that so that they can be used in Flair now.

stefan-it avatar Mar 26 '24 13:03 stefan-it