typesense
typesense copied to clipboard
[Feature request] multiple snippets from various parts of the document
We already discussed this in the slack channel but I'm creating this feature request in case others are interested as well.
Use case: In a long document there are 10 occurrences of "keyword" and I'd like to display all 10 snippets. Right now typesense returns just one snippet (the best matched one) for each document.
It is possible to implement this on the client side (by splitting the long document into paragraphs/sentences) but it would be useful to have typesense support this functionality via a flag.
+1 For large fields, it is not really feasible to transfer the entire field over the wire just for the purpose of generating snippets.
Lucene, Solr and Elasticsearch all have some version of this that shouldn't be too difficult to port?
One work around for this will be to index the document as paragraphs within a string array field. This will return multiple matching snippets in the highlight. Would that help?
While that helps, it has a few shortcomings. a. In documents that have no paragraphs, it's not obvious where to split. b. Because of (a) it is not possible to ensure fixed snippet size (e.g. retrieving the necessary n tokens before/after after the query might fail because of the split happening earlier). c. It introduces an extra layer of abstraction (which paragraphs belong to which document), whereas one record per document is straightforward.
Overall it would be cleaner/faster if this was supported natively via a multiple_matches_per_record
or similar flag.
+1 I'm currently handling snippets clientside but this approach is crude and more importantly it's a major performance bottleneck.
As others have noted it's not always possible to use a string array when the indexed docs aren't easily divisible into smaller sections.
The fully highlighted fields can be very large, and often there are only a few hits across a large piece of text, which results in lots waste.
An added flag to choose the number of snippets returned, keeping the current default of 1, would make for much simpler and faster search across large docs where highlight context is important.
Just a bump for this.. Edit: I think this issue is about the same thing: https://github.com/typesense/typesense/issues/527
+1
Also running into this, would be very helpful to have multiple snippets returned from a single text field.
We're still very interested in seeing this.
@bnfd
I agree that this is a requirement for some use cases, and definitely want to get to the right solution.
You've made some good points against splitting of a large document into multiple parts. However, one significant downside of storing a large document on disk and retrieving it for highlighting is that it's very I/O bound and slow.
I'm wondering if we can do something to mitigate the disadvantages of the splitting approach. For e.g. I'm wondering if we can split the documents with overlapping chunks automatically, would that help in addressing your concerns?
Yes, that would help.
@kishorenc Do you have an ETA for this?