BlackLab
BlackLab copied to clipboard
A confusion while using the 'termfreq' feature about its result
Hello, when I using 'termfreq' by this request: http://localhost:8080/blacklab-server/my-index/termfreq?number=1000&outputformat=json
, and I got the response like this:
{
"termFreq": {
"1": 527,
"2": 446,
"3": 287,
"4": 206,
"5": 142,
"6": 114,
"8": 106,
"10": 152,
"15": 113,
"30": 112,
"1960": 172,
"1961": 134,
"the": 69971,
"of": 36412,
"and": 28853,
"to": 26158,
"a": 23195,
"in": 21337,
"that": 10594,
"is": 10109,
"was": 9815,
"he": 9548,
"for": 9489,
//...
}
}
And would it be appropriate to place these numbers at the forefront? It makes me very confusion. Thank you very much!
I understand your confusion; because JSON objects are unordered, the results aren't sorted in any logical way (alphabetical or by frequency).
An alternative to using termfreq
is to find all words ([]
) and group by word matched (hit:word:i
), which does allow you to sort by group size
or identity
. This is handled in an optimized way internally. Here's how to do it:
/blacklab-server/corpusname/hits?patt=%5B%5D&group=hit%3Aword%3Ai&sort=size&outputformat=json
More information can be found here.