ANNIS icon indicating copy to clipboard operation
ANNIS copied to clipboard

ANNIS 4 corpus list no longer displays size in tokens/documents

Open amir-zeldes opened this issue 3 years ago • 6 comments

It would be nice to have these like in ANNIS3

amir-zeldes avatar Apr 15 '21 19:04 amir-zeldes

We should discuss how we can express size in a flexible data model. Starting with the multiple segmentation feature, the token count was already a little bit bogus. I think there are two separate use cases for this, with different solutions:

  • Getting a baseline for statistics: this is highly dependent on the research question and the actual corpus data model. The old token count was misleading in this case and the actual number was possible to get with and query. We might want to extend the example query concept to include "baseline" queries from the corpus authors for typically baseline sizes
  • For getting a feel how "large" a corpus is. For this I'm not sure if we want to use something like "number of annotations" instead (which is also easier to query from the database).

thomaskrause avatar Aug 21 '21 15:08 thomaskrause

I think if we don't know anything about a corpus semantically, then the reasonable thing to display is the count result of the tok query. After all, if a Coptic corpus has a norm and a norm_group segmentation, there is no way you can guess which count should be displayed.

That said, I think an alternative count target could be set using corpus properties by the corpus designer, in which case some other sensible number could be taken (for Coptic it may well be norm which most people would expect, since that's the same as "word units" or "things that have a POS tag", and probably similarly for Arabic or Hebrew).

amir-zeldes avatar Aug 22 '21 16:08 amir-zeldes

Ok, so this would be like a configurable query in the corpus-config.toml (http://korpling.github.io/ANNIS/4.1/user-guide/import-and-config/corpus-config.html), which defaults tok if not specified. I still have to think how to implement this in a way that makes it both efficient and also easy to update when the corpus is changed. I also actually like to have more space for the corpus name because we tend to have longer names and they have been trimmed in the previous version. Displaying this in the corpus browser info window might be a possibility.

thomaskrause avatar Aug 23 '21 09:08 thomaskrause

Space is nice, but please don't remove the size display! Not only do I personally like and use it, but users are known to be disgruntled whenever features disappear...

amir-zeldes avatar Aug 23 '21 14:08 amir-zeldes

Bump - this thing has been a main reason we never upgraded our server to v4!

amir-zeldes avatar Feb 09 '24 14:02 amir-zeldes

I had a quick look at the latest release and see this is still not included, right? I noticed corpus-config.toml already has a base_text_segmentation - couldn't that be used to determine what annotation to count for the basic 'tok' quantity?

amir-zeldes avatar Mar 13 '24 14:03 amir-zeldes