skywalking
skywalking copied to clipboard
[Index] Support full text searching
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
We are obliged to introduce the full text searching for finding endpoints with a large volume of more than millions.
BanyanDB has supported an inverted index. We should introduce a flag to IndexRule to enable text segmentation. https://github.com/blevesearch/segment is a potential candidate.
We also introduce two new binary operations to query criteria. Their names should be Match
and MarchNot
, which accept a text-matching expression.
Use case
No response
Related issues
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
After some investigation, https://github.com/blugelabs/bluge is an ideal option to support this requirement. Bluge supports several analyzers and data types that converted our use cases. Furthermore, we could replace the current primitive inverted index with it. The alternative is more lightweight which leverage the mmap to get high throughput.
But its writer is a batch-style interface. We have to tweak current index pipeline to support the batch mode. We have found the index pipeline is a blocker in the writing process: The single goroutine is blocking many writing goroutines. The new pipeline implementation would introduce a worker pool and batch mode to improve performance.
@wu-sheng @wankai123 Would you please provide some use cases to verify this feature? I want to add them to the UTs.
What kinds of cases do you need? What text should be supported for text searching?
Typically, there are
GET::/root/product/order should be able to search through root
, product
, order
, /root/product
, /product/order
or GET
, GET::/root
.
org.apache.skywalking.test.service.OrderService.order(an example of Dubbo or gRPC) should be able to search by org
, org.apache
, org.apach....OrderService
, OrderService.order
etc.
.
, /
, whitespace
, ,
, :
, new line
, ;
, '
, "
should be considered as typical stop words. The list includes some chars which are useful in Log search case only. But I remember BanyanDB is not supporting text-based query for log rawtext, right? If so, usually .
, /
, whitespace
, ,
, :
matters.
But I remember BanyanDB is not supporting text-based query for log rawtext, right?
We could create an index rule with a simple analyzer to log raw text(its data type should be a string instead of binary) for searching. The simple analyzer will break text at any non-letter character.
If you want to add this capability, it is good. @lujiajing1126 would need to adjust the banyand storage implementation to declare the log query supporting text search, otherwise, the UI would block the query from end users as it is right now.