Enhance support for loading/caching corpora/datasets

Open reckart opened this issue 8 years ago • 0 comments

[ ] drop id and base all actions on groupId/datasetId/version/language/mediaType
[ ] migrate UD dataset to DatasetFactory - problem here is that the DS is very large and it is tedious to manually create all the description files. In the old approach, we could get the DS information after downloading. But in order to integrate the info into the documentation, we now need it statically.
[ ] automatically augment Known corpora in documentation with integrated datasets during documentation generation
[ ] show list of readers for each corpus / link media type to readers supporting that media type
[ ] add API to query for dataaset based on its properties, e.g. get all German CoNLL 2006 datasets
[ ] add information about annotation types that can be obtained from the datasets (e.g. Token, Sentence, POS, etc.)
[ ] add tagset information
[ ] consider adding a roles section inside the artifacts
[ ] ...?