Luke Gallagher
Luke Gallagher
Add support for the [Common Index File Format](https://github.com/osirrc/ciff) (CIFF). This will likely depend on having more flexible options for field indexing described in #11
When features are extracted, a file should be created that maps feature id's to feature names. Currently not all features are extracted in unison, so a first step would be...
A basic program to perform first stage retrieval. Efficiency is a non-goal. First stage efficiency can be found in software like [PISA](https://github.com/pisa-engine/pisa). The win here would be to use any...
Implement the sequential dependence (SD) part of [A Markov Random Field Model for Term Dependencies](https://ciir.cs.umass.edu/pubfiles/ir-387.pdf) The implementation is based on the one within Indri. **Statistics counting.** Relevant parts for finding...
For tracking a number of issues related to pre-retrieval document independent features. * [ ] `preret_csv` assumes unigram and bigram files are text * [ ] `preret_csv` assumes user must...
It is not clear how much extra CPU is used on repeated calls to document decompress. See the following inner loop in `extract_features`: https://github.com/rmit-ir/tesserae/blob/c565cda55765e8491cb184439d8fbb296aba5d4a/src/extract_features.cpp#L580 The `Document` class should take a...
Originally from [Büttcher et al](https://plg.uwaterloo.ca/~claclark/sigir2006_term_proximity.pdf). Modifications exist to make it usable for dynamic pruning scenarios: * [Schenkel et al, SPIRE 07](http://infolab.stanford.edu/~theobald/pub/proximity-spire07.pdf) * [Broschart et al, TOIS 12](https://dl.acm.org/doi/abs/10.1145/2094072.2094077) Other variations: *...
An internal version number representation is required so that the programs that use the index can identify whether or not they are compatible with the Tesserae index they are attempting...
The current process for creating an index is rudimentary. A possible better alternative would be to have a single binary program that performs of the details of constructing the index...
Currently the extracted fields are hard-coded and make assumptions about the dataset. Configuration of the fields to be processed at index time is needed.