tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

Dynamic Schema

Open PSeitz opened this issue 9 months ago • 0 comments

Problem Outline

  1. tantivy root fields can't be indexed without explicit configuration, e.g. there's no index everything mode. Workaround is to copy into a seperate JSON field.
  2. Nested fields can't be configured differently from their parent for JSON fields. Workaround is be to copy into a seperate field. (https://github.com/quickwit-oss/quickwit/issues/3896)

That's a lot of copying and also difficult to untangle when searching the data. E.g. there's a special code behind a feature flag for quickwit _dynamic field.

Core Issues:

  1. Configuration
  2. Indexing Performance (affected by copying).
  3. Untangle copied data on retrieval

Requirements

  1. Allow configurations for any nested field
  2. Enable schemaless indexing of root level fields
  3. Ensure quick access to a value's current config without compromising indexing performance
  4. Avoid overhead of field name->config resolution where possible

Proposal

The proposed changes affect schema definition, indexing and the Index data-structures.

Schema API

Schema allows to configure a root aka catch all config. Instead of configuring fields, we define configurations on paths. Each configuration would get an Configuration ID, that can be referenced.

Example JSON

{
    "a": "b", // will be indexed with root config
    "severity": "INFO",
    {"attributes": 
        { "price": 100, "tags": ["blue"] }
    }
}
let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.set_dynamic_schema(JsonObjectOptions{fast: true, indexed: true}); // All fields will be fast and indexed
schema_builder.add_json_field("attributes", DISABLED); // ignore (also nested fields). Do we need this feature?
schema_builder.add_u64_field("attributes.price", FAST | INDEXED); // overrule nested `attribute.price` to be fast and indexed
schema_builder.add_text_field("severity", STRING);
let schema = schema_builder.build();

let attr_config_id = schema.get_config_id("attributes.price"); // FAST | INDEXED | U64

Document API

Documents would return (Path, Config ID, Value) pairs.

fn iter_fields_and_values(&self, schema: &Schema) -> impl Iterator<Item = (&str, Config ID, Value)>

Returning a Config ID would keep the limitation that we only iterate configured fields. The Problem is that the API will get very ugly with regards to Object and Array Iterator types on Value. To mitigate this, we can switch to only accept flatten paths, so nested structures would be flattened. We flatten them anyways while indexing.

We can provide an Iterator helper, which does the flattening and lookup from the schema to fetch the config id.

fn iter_fields_and_values(&self, path_writer: &mut JsonPathWriter, schema: &Schema) -> impl Iterator<Item = (&str, Config ID, LeafValue)> {
    let vals_iter: impl Iterator<Item=(&str, Value)> = self.get_iter();
    FlattenValueIter(vals_iter, path_writer, schema) // -> impl Iterator<Item = (&str, Config ID, Value)>`
}

Doc Store

If we don't want to store flattened values in the docstore, we would need another method on Document, to fetch the filtered Value tree. Here we could provide another iterator helper StoredFieldsIter, which only returns stored values.

fn iter_stored_fields(&self, path_writer: &mut JsonPathWriter, schema: &Schema) -> impl Iterator<Item = (&str, Value)> {
    let vals_iter: impl Iterator<Item=(&str, Value)> = self.get_iter();
    StoredFieldsIter(vals_iter, path_writer, schema) // > impl Iterator<Item = (&str, Value)>
}

Downsides

This moves some of the burden of correctness to the Document, which may be user provided. E.g. the user provided document could return wrong field name + config combinations, leading to unexpected behaviour. Via provided Iterator helpers this is mostly mitigated.

Further Considerations

Avoid repeatedly storing long nested field names in the term hashmap during indexing. We can replace the path with an unordered ID and replace it during serialization.

fn iter_fields_and_values(&self, path_writer: &mut JsonPathWriter, schema: &Schema, path_to_unordered_id: &mut PathToUnorderedID) -> impl Iterator<Item = (PathID, Config ID, Value)>

PathToUnorderedID would be prepopulated with the static defined fields from the schema. This would be really close to the current API (Field, Value), and allow no performance regression for static defined schemas.

(Related Issue: https://github.com/quickwit-oss/tantivy/issues/2015)

Datastructures

FastFields

Unchanged. We already store flattened fast field names.

DocStore

Unchanged, except we store flattened data.

Inverted Index

If the path is configured exactly, we will create an index for that path.

There will a global catch all index for all nested fields.

Example:

{
    "product_name": "droopy t-shirt",
    "brand": "Droopy Apparel Co.",
    "attributes": {
        "color": ["red", "green", "white"],
        "size:": 50
    }
}

let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.set_root(JsonObjectOptions{fast: true, indexed: true});
schema_builder.add_text_field("product_name", STRING);
schema_builder.build();

--> Inverted Indices (ignoring tokenization)
Global (Path + Value)
"brand" -> "Droopy Apparel Co."
"attributes.color" -> "red"
"attributes.color" -> "green"
"attributes.color" -> "white"
"attributes.size" -> 50
product_name  (Value)
"droopy t-shirt"

Further Considerations (Inverted Index)

It probably makes sense to keep the data unified, so we could put all the data in a global index. The main problem I see for quickwit would be the requirement to load the whole index without additional metadata. When partitioning the space on field names and creating multiple term dicts this is could work. Needs more evaluation.

Related issues https://github.com/quickwit-oss/quickwit/issues/3896 https://github.com/quickwit-oss/quickwit/issues/3607 https://github.com/quickwit-oss/tantivy/issues/2015

PSeitz avatar Oct 12 '23 16:10 PSeitz