kontext Implement an automatic workflow for preparing corpora data and installing them into a running KonText instance

User should be able to:

compile vertical file into Manatee binary format
generate tag variants (if applicable)
generate text types metadata
obtain an XML snippet to update corpora.xml

Implementation notes:

Celery based
command-line as a primary mode
plug-in aware

Apr 12 '16 08:04 tomachalek

We need this too. It would be great to do this together. For our use it must be end-user software that will:

allow to compile a corpus (a separate task),
let user run any workflows available for that data (process it)
Send it to the next step:
- Download data (maybe not free without the options 2 and 3?)
- Publish data via LINDAT Repo
- Make it searchable in Kontext ...
Similar to SketchEngine’s Corpus Architect, but
- with Treex (and other posssible) backend(s) and
- post-editing / review annotation option (again a separate app)

We should meet and discuss this in person.

Apr 12 '16 11:04 stranak

OK, I am going to discuss this with @michkren and then let you know (via this thread).

Apr 12 '16 12:04 tomachalek

OK

Apr 12 '16 12:04 stranak

@michkren agrees to meet and discuss the issue. Can you please suggest a date (and possibly a place)? We prefer dates starting from 25.4.2016

Just to make clear our intentions. Initially we were thinking about processing starting from a finalized vertical file (i.e. no raw text parsing, annotation etc.). But once we have an environment to define tasks and their dependencies, it should be quite easy to incorporate any Lindat-specific or CNC-specific workflows.

I am starting to think about using some existing framework for processing job pipelines (.e.g. Luigi looks interesting) and to focus on defining individual jobs and their combinations. But we will certainly discuss this later.

Apr 12 '16 13:04 tomachalek

OK, we'll solve the meeting via other means :-)

As for the processing we have the Treex::Web environment to setup and execute any pipelines. You are welcome to using it: either the REST service online, or directly in some form. Treex itself is used a lot internally at UFAL on our cluster. Have a look and then we can discuss if it fits your needs.

Apr 12 '16 14:04 stranak

@tomachalek @stranak Has there been any activity on this front since 2016?

Jan 22 '18 16:01 duhaime

@duhaime Unfortunately no. But I'll try to bring up the issue at our next meeting between Czech National Corpus and UFAL) next week. Once I know more information about current priority of the project and its present outlook, I'll post it here.

Jan 23 '18 21:01 tomachalek

@tomachalek That sounds great!

Jan 23 '18 23:01 duhaime

@duhaime As I've just found out, the meeting is not on next week but on Feb 12. Sorry for the misinformation - but it is still quite soon :-)

Jan 24 '18 07:01 tomachalek

Although we will very likely work on an application for creating user corpora, it will be rather a separate solution to keep KonText reasonably small and maintainable.

Sep 17 '25 10:09 tomachalek