PSeitz

Results 117 issues of PSeitz

Currently the layout and it's behaviour (serialization etc.) for Document is provided by tantivy. Users of tantivy have to convert their structure into the tantivy Document. An alternative approach would...

For compatibility reasons, we always create a multi value fast field index when creating a fast field on a string field. When the field is effectively single values, we may...

Currently the Date type is filtered as valid fast field type, but this may be overly strict.

Expotential Unrolled List read_to_end in expull may consume a lot of memory. Since it is used by the postinglist record, it contains all docids(+optional positions, term frequencies) for one term,...

# Datasets For the [fast field codecs](https://github.com/quickwit-oss/tantivy/tree/main/fastfield_codecs) we need to have good datasets to test them. Ideally this are datasets which we would expect to be indexed in a search...

The histogram performance comparison between the generic solution from (https://github.com/quickwit-oss/tantivy/tree/main/src/aggregation) with the specialized histogram (https://github.com/quickwit-oss/tantivy/blob/main/src/collector/histogram_collector.rs) suggests there is some headroom for improvement The bench collects 1_000_000 docs into 10_000 buckets....

https://docs.rs/tantivy/latest/tantivy/postings/struct.InvertedIndexSerializer.html links to https://fulmicoton.gitbooks.io/tantivy-doc/content/inverted-index.html, which contains outdated information, e.g. simdcomp. The site looks nice, but we overall there are too many different resources, we should probably deprecate some and concentrate...

# Strategy to find the best encoder When serializing a fast field, there are in principle multiple possible encoders to choose from. It's important to find the best one in...

# Blocked term dictionaries Currently a term dictionary in a field is stored as a fst, which contains _all_ terms in their lexicographic order. In some scenarios it's better to...

We should consider to move to a capability based API. Instead of checking against `INDEX_FORMAT_VERSION`, the current tantivy version would have a list of capabilities and the index is stored...