BlackLab
BlackLab copied to clipboard
Filtering on Numeric Metadata
I'm trying to filter documents by a numeric metadata field (called order in this instance). I'm using the configuration file to specify that the field is numeric:
# Embedded metadata in document
metadata:
# What element contains the metadata (relative to documentPath)
containerPath: metadata
# What metadata fields do we have?
fields:
- name: order
valuePath: "@order"
type: numeric
And I see this reflected correctly in the generated indexmetadata.yml yaml file:
metadataFields:
docId:
displayName: "Doc id"
uiType: ""
description: ""
type: "tokenized"
analyzer: "DEFAULT"
unknownValue: "unknown"
unknownCondition: "NEVER"
valueListComplete: true
values:
"258c8566-cc48-452e-b726-f8f02c3d0bc3": 109
b1b8a57e-cb3d-4aef-9fad-f29416fe3f74: 114
"006b534d-bc5e-42ba-81df-4a8a5478a214": 114
f039853c-68a0-4748-b96e-6b45404c3ecf: 180
"171ada1f-f90f-439a-baaf-f87da8514766": 240
"6e81856f-5e0d-4ddd-bdbb-aeaade126e45": 19
"47759b81-baa6-4923-83ad-fb10a39d1018": 182
"4c81517d-a01f-4d48-a712-42863e31a315": 942
"5315dcf3-a8dc-4809-a200-eef6bab67228": 46
ae793287-e365-4504-9efc-48cd2e8dce9a: 238
b8152417-4933-410b-8e95-698aedc4285c: 46
"0c34c8da-8b0c-411d-aa87-00b613bd2065": 154
d7219914-89fd-443c-81b3-46ab19d84f32: 86
"6218843c-c51d-42a5-93e0-d0477ee21906": 221
"2d38164c-abe4-4a98-a32f-68abc9e221b0": 62
be488f30-a8c1-46db-b0ac-818431f038e6: 114
"4b8b357d-9d38-48b3-abf8-5825ea0e9d61": 13
bf47fa96-139b-4746-b3f4-858593ef621b: 1
"2b085f71-e343-4b47-8e97-b3375f7a85ef": 1
eed934f8-e9d6-48a7-8a35-0b9324d815ec: 46
dda297a7-99f7-489c-a30f-fbbd50cc0fee: 156
"6ebafd9d-07f4-496d-9c2f-31bdd7c422e1": 6
displayValues: {}
displayOrder: []
order:
displayName: "Order"
uiType: ""
description: ""
type: "numeric"
analyzer: "DEFAULT"
unknownValue: "unknown"
unknownCondition: "NEVER"
valueListComplete: false
values:
"44": 17
"45": 17
"46": 14
"47": 14
"48": 14
......
displayValues: {}
displayOrder: []
However, when I add a filter query, for e.g.: POINT query:
order:0
or RANGE query:
order:[10 TO 20]
fI do not get back any hits, even though on inspecting documents in the index, I do see that there exists documents that meet the filter criteria (I can confirm by switching the filter to another metadata field for the same document).
I always get back 0 hits
<?xml version="1.0" encoding="utf-8" ?>
<blacklabResponse>
<summary>
<searchParam>
<filter>order:[0 TO 200]</filter>
<first>0</first>
<indexname>docuser:e9ff4e7f-ade4-4655-9c81-9720b643f70e_IVBM</indexname>
<number>1000</number>
<patt>""</patt>
<usecontent>orig</usecontent>
</searchParam>
<searchTime>1</searchTime>
<countTime>1</countTime>
<windowFirstResult>0</windowFirstResult>
<requestedWindowSize>1000</requestedWindowSize>
<actualWindowSize>0</actualWindowSize>
<windowHasPrevious>false</windowHasPrevious>
<windowHasNext>false</windowHasNext>
<stillCounting>false</stillCounting>
<numberOfHits>0</numberOfHits>
<numberOfHitsRetrieved>0</numberOfHitsRetrieved>
<stoppedCountingHits>false</stoppedCountingHits>
<stoppedRetrievingHits>false</stoppedRetrievingHits>
<numberOfDocs>0</numberOfDocs>
<numberOfDocsRetrieved>0</numberOfDocsRetrieved>
<docFields>
<pidField>sectionId</pidField>
<titleField>sectionId</titleField>
</docFields>
<metadataFieldDisplayNames>
<docId>Doc id</docId>
<order>Order</order>
</metadataFieldDisplayNames>
</summary>
<hits>
</hits>
<docInfos>
</docInfos>
</blacklabResponse>
Is this a supported use case?
We don't really use numeric search fields ourselves (other than for a few bookkeeping fields we don't search on), but they should work of course, so this is a bug. I'll try to have a look at it soon, I can't say when I'll have time unfortunately.
In the meantime, do you absolutely need this field to be numeric? You could try indexing it as a regular field. If you need sorting and range searching to work right, you could add leading zeros to all values. A bit of a pain, but at least that way you should avoid this bug.
Thanks @jan-niestadt - just wanted to confirm that I'm not missing something obvious.
I resorted to a slightly different approach, we do need range numeric queries, but on relatively short ranges, so for the time being I'm indexing them as non-numeric and expanding the range queries into OR clauses, hacky but effective. If we need longer ranges or sorting I'll resort to your suggested workaround.
Thanks for the quick turnaround! I'll keep monitoring this issue for updates, and let you know if I decide to try tackling it myself (to avoid duplicating the effort). I definitely won't attempt a fix until you can at least repro and confirm.
This took way too long. I've looked into it now and I can see that numeric fields are completely broken. The field type is read from the configuration, but is lost during indexing because DocIndexer doesn't seem to take type into account. So the index metadata doesn't even list the field as numeric anymore.
First step will be to fix that, then figure out why even before this broke, filtering on numeric range wasn't working.
Testing on the current dev version, the only problem left appeared to be that regular term (range) queries were produced for all field types, even numeric ones. I've now subclassed QueryParser to address this and numeric fields seem to work (at least in the integrated index format).
Awesome change, thanks @jan-niestadt!