Implement an automatic workflow for preparing corpora data and installing them into a running KonText instance
User should be able to:
- compile vertical file into Manatee binary format
- generate tag variants (if applicable)
- generate text types metadata
- obtain an XML snippet to update corpora.xml
Implementation notes:
- Celery based
- command-line as a primary mode
- plug-in aware
We need this too. It would be great to do this together. For our use it must be end-user software that will:
- allow to compile a corpus (a separate task),
- let user run any workflows available for that data (process it)
- Send it to the next step:
- Download data (maybe not free without the options 2 and 3?)
- Publish data via LINDAT Repo
- Make it searchable in Kontext ...
- Similar to SketchEngine’s Corpus Architect, but
- with Treex (and other posssible) backend(s) and
- post-editing / review annotation option (again a separate app)
We should meet and discuss this in person.
OK, I am going to discuss this with @michkren and then let you know (via this thread).
OK
@michkren agrees to meet and discuss the issue. Can you please suggest a date (and possibly a place)? We prefer dates starting from 25.4.2016
Just to make clear our intentions. Initially we were thinking about processing starting from a finalized vertical file (i.e. no raw text parsing, annotation etc.). But once we have an environment to define tasks and their dependencies, it should be quite easy to incorporate any Lindat-specific or CNC-specific workflows.
I am starting to think about using some existing framework for processing job pipelines (.e.g. Luigi looks interesting) and to focus on defining individual jobs and their combinations. But we will certainly discuss this later.
OK, we'll solve the meeting via other means :-)
As for the processing we have the Treex::Web environment to setup and execute any pipelines. You are welcome to using it: either the REST service online, or directly in some form. Treex itself is used a lot internally at UFAL on our cluster. Have a look and then we can discuss if it fits your needs.
@tomachalek @stranak Has there been any activity on this front since 2016?
@duhaime Unfortunately no. But I'll try to bring up the issue at our next meeting between Czech National Corpus and UFAL) next week. Once I know more information about current priority of the project and its present outlook, I'll post it here.
@tomachalek That sounds great!
@duhaime As I've just found out, the meeting is not on next week but on Feb 12. Sorry for the misinformation - but it is still quite soon :-)
Although we will very likely work on an application for creating user corpora, it will be rather a separate solution to keep KonText reasonably small and maintainable.