ELMoForManyLangs
ELMoForManyLangs copied to clipboard
Document what tokenisation was used for the offered models
Closed issue #45 indicates that udpipe was used and __main__.py
suggests that you use the expanded form for conll multiword tokens, e.g. 2 tokens "de le" instead of "du" in French. The readme should mention both.
However, the config.json
of a downloaded model suggests that the model was not trained on a conllu file: "train_path": "/users4/conll18st/raw_text/Czech/cs-20m.raw"
. Has this historic reasons, i.e. was conllu input format only added later to elmoformanylangs and you used an external conlly-to-raw converter at the time?
was conllu input format only added later to elmoformanylangs and you used an external conlly-to-raw converter at the time?
cs-20m.raw was obtained from an external conllu-to-raw script. the original data can be found at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 and it was preprocessed by udpipe.