tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Querying documents that contain a specific key in JSON field

Open Nickersoft opened this issue 2 years ago • 5 comments

Hey folks,

Was going to maybe file this as a bug, but I'm also new to Tantivy, so I don't want to rule out that maybe I'm just doing this wrong lol. I currently am trying to index a number of documents and their translations using the following schema:

    schema_builder.add_u64_field("id", FAST | INDEXED);
    schema_builder.add_text_field("text", TEXT | STORED);
    schema_builder.add_text_field("language", STRING);
    schema_builder.add_u64_field("length", FAST | INDEXED);
    schema_builder.add_json_field("translations", STORED | INDEXED);

The translations field is a JSON object mapping language codes (such as fra or spa to their non-Tantivy document IDs). I want to be able to search for documents and filter out ones that don't have a translation in a specific language (i.e. don't have a key for that language in the translations object). I'm currently doing this via the following query:

    let query_str = format!(
        "{} AND language:eng AND translations.fra:+",
        query
    );

Which should search for all documents containing my query that are English but have French translations. However, Tantivy still returns documents that don't contain a fra key in the translations object. If I replace the + with an actual ID, it'll return a single document that has that specific ID under its fra key. Yet I can't seem to just generally query for documents that just have something under that key.

Is there something I'm overlooking, or is this a misbehavior on Tantivy's part? My repo is open-source, so I can put together repro steps if needed. Hopefully this all makes sense.

Thanks!

Nickersoft avatar Jan 30 '23 05:01 Nickersoft

Related to https://github.com/quickwit-oss/tantivy/issues/1833

PSeitz avatar Jan 31 '23 07:01 PSeitz

What does + mean here, did you mean translations.fra:*?

I think * is not supported on a field value. It would make sense though.

PSeitz avatar Jan 31 '23 07:01 PSeitz

Ah yes, I tried both + and * and had luck with neither lol. I thought maybe it was like regex and * would match empty fields, and + could match fields where there is at least 1 character as a value. So is there currently no workaround as to how I could achieve this kind of query or data structure? I've been racking my brain haha.

Nickersoft avatar Jan 31 '23 19:01 Nickersoft

You could test the ExistsQuery from that PR. May need some adjustment for JSON fields though

PSeitz avatar Feb 02 '23 06:02 PSeitz

Given the fact that there is no movement on that PR, and I'm not really sure what the code is doing tbh (had trouble finding docs on how the Weight struct + scoring actually works), are there any workarounds or alternatives? This issue is currently blocking all progress on the project I'm working on 😅

Nickersoft avatar Apr 16 '23 23:04 Nickersoft