BlackLab icon indicating copy to clipboard operation
BlackLab copied to clipboard

Filtering on Numeric Metadata

Open emadelwany opened this issue 5 years ago • 3 comments

I'm trying to filter documents by a numeric metadata field (called order in this instance). I'm using the configuration file to specify that the field is numeric:

# Embedded metadata in document
metadata:

  # What element contains the metadata (relative to documentPath)
  containerPath: metadata

  # What metadata fields do we have?
  fields:

  - name: order
    valuePath: "@order"
    type: numeric

And I see this reflected correctly in the generated indexmetadata.yml yaml file:

metadataFields:
    docId:
      displayName: "Doc id"
      uiType: ""
      description: ""
      type: "tokenized"
      analyzer: "DEFAULT"
      unknownValue: "unknown"
      unknownCondition: "NEVER"
      valueListComplete: true
      values:
        "258c8566-cc48-452e-b726-f8f02c3d0bc3": 109
        b1b8a57e-cb3d-4aef-9fad-f29416fe3f74: 114
        "006b534d-bc5e-42ba-81df-4a8a5478a214": 114
        f039853c-68a0-4748-b96e-6b45404c3ecf: 180
        "171ada1f-f90f-439a-baaf-f87da8514766": 240
        "6e81856f-5e0d-4ddd-bdbb-aeaade126e45": 19
        "47759b81-baa6-4923-83ad-fb10a39d1018": 182
        "4c81517d-a01f-4d48-a712-42863e31a315": 942
        "5315dcf3-a8dc-4809-a200-eef6bab67228": 46
        ae793287-e365-4504-9efc-48cd2e8dce9a: 238
        b8152417-4933-410b-8e95-698aedc4285c: 46
        "0c34c8da-8b0c-411d-aa87-00b613bd2065": 154
        d7219914-89fd-443c-81b3-46ab19d84f32: 86
        "6218843c-c51d-42a5-93e0-d0477ee21906": 221
        "2d38164c-abe4-4a98-a32f-68abc9e221b0": 62
        be488f30-a8c1-46db-b0ac-818431f038e6: 114
        "4b8b357d-9d38-48b3-abf8-5825ea0e9d61": 13
        bf47fa96-139b-4746-b3f4-858593ef621b: 1
        "2b085f71-e343-4b47-8e97-b3375f7a85ef": 1
        eed934f8-e9d6-48a7-8a35-0b9324d815ec: 46
        dda297a7-99f7-489c-a30f-fbbd50cc0fee: 156
        "6ebafd9d-07f4-496d-9c2f-31bdd7c422e1": 6
      displayValues: {}
      displayOrder: []
    order:
      displayName: "Order"
      uiType: ""
      description: ""
      type: "numeric"
      analyzer: "DEFAULT"
      unknownValue: "unknown"
      unknownCondition: "NEVER"
      valueListComplete: false
      values:
        "44": 17
        "45": 17
        "46": 14
        "47": 14
        "48": 14
         ......
      displayValues: {}
      displayOrder: []

However, when I add a filter query, for e.g.: POINT query:

order:0

or RANGE query:

order:[10 TO 20]

fI do not get back any hits, even though on inspecting documents in the index, I do see that there exists documents that meet the filter criteria (I can confirm by switching the filter to another metadata field for the same document).

I always get back 0 hits

<?xml version="1.0" encoding="utf-8" ?>
<blacklabResponse>
    <summary>
        <searchParam>
            <filter>order:[0 TO 200]</filter>
            <first>0</first>
            <indexname>docuser:e9ff4e7f-ade4-4655-9c81-9720b643f70e_IVBM</indexname>
            <number>1000</number>
            <patt>&quot;&quot;</patt>
            <usecontent>orig</usecontent>
        </searchParam>
        <searchTime>1</searchTime>
        <countTime>1</countTime>
        <windowFirstResult>0</windowFirstResult>
        <requestedWindowSize>1000</requestedWindowSize>
        <actualWindowSize>0</actualWindowSize>
        <windowHasPrevious>false</windowHasPrevious>
        <windowHasNext>false</windowHasNext>
        <stillCounting>false</stillCounting>
        <numberOfHits>0</numberOfHits>
        <numberOfHitsRetrieved>0</numberOfHitsRetrieved>
        <stoppedCountingHits>false</stoppedCountingHits>
        <stoppedRetrievingHits>false</stoppedRetrievingHits>
        <numberOfDocs>0</numberOfDocs>
        <numberOfDocsRetrieved>0</numberOfDocsRetrieved>
        <docFields>
            <pidField>sectionId</pidField>
            <titleField>sectionId</titleField>
        </docFields>
        <metadataFieldDisplayNames>
            <docId>Doc id</docId>
            <order>Order</order>
        </metadataFieldDisplayNames>
    </summary>
    <hits>
  </hits>
    <docInfos>
  </docInfos>
</blacklabResponse>

Is this a supported use case?

emadelwany avatar Nov 26 '19 01:11 emadelwany

We don't really use numeric search fields ourselves (other than for a few bookkeeping fields we don't search on), but they should work of course, so this is a bug. I'll try to have a look at it soon, I can't say when I'll have time unfortunately.

In the meantime, do you absolutely need this field to be numeric? You could try indexing it as a regular field. If you need sorting and range searching to work right, you could add leading zeros to all values. A bit of a pain, but at least that way you should avoid this bug.

jan-niestadt avatar Nov 27 '19 09:11 jan-niestadt

Thanks @jan-niestadt - just wanted to confirm that I'm not missing something obvious.

I resorted to a slightly different approach, we do need range numeric queries, but on relatively short ranges, so for the time being I'm indexing them as non-numeric and expanding the range queries into OR clauses, hacky but effective. If we need longer ranges or sorting I'll resort to your suggested workaround.

Thanks for the quick turnaround! I'll keep monitoring this issue for updates, and let you know if I decide to try tackling it myself (to avoid duplicating the effort). I definitely won't attempt a fix until you can at least repro and confirm.

emadelwany avatar Nov 27 '19 17:11 emadelwany

This took way too long. I've looked into it now and I can see that numeric fields are completely broken. The field type is read from the configuration, but is lost during indexing because DocIndexer doesn't seem to take type into account. So the index metadata doesn't even list the field as numeric anymore.

First step will be to fix that, then figure out why even before this broke, filtering on numeric range wasn't working.

jan-niestadt avatar Jul 13 '22 08:07 jan-niestadt

Testing on the current dev version, the only problem left appeared to be that regular term (range) queries were produced for all field types, even numeric ones. I've now subclassed QueryParser to address this and numeric fields seem to work (at least in the integrated index format).

jan-niestadt avatar Jun 01 '23 12:06 jan-niestadt

Awesome change, thanks @jan-niestadt!

emadelwany avatar Jun 01 '23 16:06 emadelwany