inception
inception copied to clipboard
Using INCEpTION for text classification
Is your feature request related to a problem? Please describe. INCEpTION is excellent when it comes to annotating relations and entities, but I don't find it suitable for text classification tasks, i.e. when I want to build datasets where the label applies to a whole sequence of text.
Describe the solution you'd like When I create a new project, currently the choices are Basic annotation (Span/relation), Entity Linking (Wikidata) and Standard project. I would like to have an additional choice: Text classification. The annotations would be exportable as a two-column CSV (first column, the string of the text, and second column, the label), or as a JSON file (or JSONL).
Describe alternatives you've considered
- I could use Doccano for that, but I would lose the inter-annotator features INCEpTION has, and I would be forced to use several tools instead of one (INCEpTION for spans/relations and Doccano for sequence classification).
- I could use the Basic annotation (Span/relation) project, and label only the first token with the class the whole sequence belongs to, and find a way to manage the conversion afterwards, but this is kinda tricky
Additional context The UI for spans/relations is quite advanced, so I think it wouldn't be hard to make a UI for text classification! Thanks again for the wonderful tool!
See https://inception-project.github.io/releases/0.19.7/docs/user-guide.html#_document_metadata
However, inter-annotator calculation is presently not available for document-level annotations.
When using document-level annotations, the export format must be XMI CAS - the other formats do not support it.
Thanks for the answer. Is it on the roadmap to implement inter-annotator agreement features for text classification?
I found this @reckart : https://colab.research.google.com/github/inception-project/inception-project.github.io/blob/master/_example-projects/python/INCEpTION_Annotations_as_one_sentence_and_label_per_line.ipynb
Is it possible to use the inter-annotator features with that?
@xegulon Not sure what you mean?
If you mean if you could use that code as a basis to export your data from INCEpTION and then do your agreement calculation externally - you could probably do that.
Regarding agreement for document-level features in the application: we now have an issue for it on the roadmap (this issue here), but the roadmap is rather dynamic - so no particular time for this feature to arrive atm.
Great thanks, I'll cope with the first solution for now I think. Eager to see the coming dedicated UI!
Precision: it would be great to implement at the same time the UI for text classification to a single class, but also to several classes (multilabel text classification).
Also, for single text examples that spread through multiple lines, it would be important to be able to import datasets as JSON(L), and not only enable dataset import as one sentence per line.
You mean you'd like a format that imports "one document per line"?
In some sort yes. But this would be possible only with JSON files. The goal is to be able to take into account files with newline characters.
I also have a similar use-case for document classification with multi-label, e.g. keyphrase extraction/generation. I would also like to use external recommeder to perform some active learning thanks to some unsupervised model at the beginning, which would help build our dataset. (or import some existing one). But I didn't find a way to be able to configure such external recommender. I think it would also be very helpful to have this kind of feature. In any case, thanks for this wonderful tool!
I am wondering about this functionality as well. My team is in need of a document level annotation and we need agreement scores for that. Is there a plan to implement agreement scores at document level of annotation?
I don't see an issue for document-level agreement in our tracker yet. Feel free to add one. Note though that having an issue is just to keep it on the radar - it does not make it a priority.