BlackLab icon indicating copy to clipboard operation
BlackLab copied to clipboard

A confusion while using the 'termfreq' feature about its result

Open yua5 opened this issue 10 months ago • 1 comments

Hello, when I using 'termfreq' by this request: http://localhost:8080/blacklab-server/my-index/termfreq?number=1000&outputformat=json, and I got the response like this:

{
    "termFreq": {
        "1": 527,
        "2": 446,
        "3": 287,
        "4": 206,
        "5": 142,
        "6": 114,
        "8": 106,
        "10": 152,
        "15": 113,
        "30": 112,
        "1960": 172,
        "1961": 134,
        "the": 69971,
        "of": 36412,
        "and": 28853,
        "to": 26158,
        "a": 23195,
        "in": 21337,
        "that": 10594,
        "is": 10109,
        "was": 9815,
        "he": 9548,
        "for": 9489,
        //...
        }
}

And would it be appropriate to place these numbers at the forefront? It makes me very confusion. Thank you very much!

yua5 avatar Apr 06 '24 12:04 yua5

I understand your confusion; because JSON objects are unordered, the results aren't sorted in any logical way (alphabetical or by frequency).

An alternative to using termfreq is to find all words ([]) and group by word matched (hit:word:i), which does allow you to sort by group size or identity. This is handled in an optimized way internally. Here's how to do it:

/blacklab-server/corpusname/hits?patt=%5B%5D&group=hit%3Aword%3Ai&sort=size&outputformat=json

More information can be found here.

jan-niestadt avatar Apr 08 '24 07:04 jan-niestadt