ELMoForManyLangs icon indicating copy to clipboard operation
ELMoForManyLangs copied to clipboard

Document what tokenisation was used for the offered models

Open jowagner opened this issue 6 years ago • 2 comments

Closed issue #45 indicates that udpipe was used and __main__.py suggests that you use the expanded form for conll multiword tokens, e.g. 2 tokens "de le" instead of "du" in French. The readme should mention both.

jowagner avatar Feb 15 '19 12:02 jowagner

However, the config.json of a downloaded model suggests that the model was not trained on a conllu file: "train_path": "/users4/conll18st/raw_text/Czech/cs-20m.raw". Has this historic reasons, i.e. was conllu input format only added later to elmoformanylangs and you used an external conlly-to-raw converter at the time?

jowagner avatar Feb 17 '19 15:02 jowagner

was conllu input format only added later to elmoformanylangs and you used an external conlly-to-raw converter at the time?

cs-20m.raw was obtained from an external conllu-to-raw script. the original data can be found at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 and it was preprocessed by udpipe.

Oneplus avatar Feb 19 '19 00:02 Oneplus