skywalking icon indicating copy to clipboard operation
skywalking copied to clipboard

[Index] Support full text searching

Open hanahmily opened this issue 2 years ago • 5 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

We are obliged to introduce the full text searching for finding endpoints with a large volume of more than millions.

BanyanDB has supported an inverted index. We should introduce a flag to IndexRule to enable text segmentation. https://github.com/blevesearch/segment is a potential candidate.

We also introduce two new binary operations to query criteria. Their names should be Match and MarchNot, which accept a text-matching expression.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

hanahmily avatar Aug 03 '22 03:08 hanahmily

After some investigation, https://github.com/blugelabs/bluge is an ideal option to support this requirement. Bluge supports several analyzers and data types that converted our use cases. Furthermore, we could replace the current primitive inverted index with it. The alternative is more lightweight which leverage the mmap to get high throughput.

But its writer is a batch-style interface. We have to tweak current index pipeline to support the batch mode. We have found the index pipeline is a blocker in the writing process: The single goroutine is blocking many writing goroutines. The new pipeline implementation would introduce a worker pool and batch mode to improve performance.

hanahmily avatar Aug 22 '22 23:08 hanahmily

@wu-sheng @wankai123 Would you please provide some use cases to verify this feature? I want to add them to the UTs.

hanahmily avatar Aug 25 '22 08:08 hanahmily

What kinds of cases do you need? What text should be supported for text searching?

Typically, there are GET::/root/product/order should be able to search through root, product, order, /root/product, /product/order or GET, GET::/root.

org.apache.skywalking.test.service.OrderService.order(an example of Dubbo or gRPC) should be able to search by org, org.apache, org.apach....OrderService, OrderService.order etc.

., /, whitespace, ,, :, new line, ;, ', " should be considered as typical stop words. The list includes some chars which are useful in Log search case only. But I remember BanyanDB is not supporting text-based query for log rawtext, right? If so, usually ., /, whitespace, ,, : matters.

wu-sheng avatar Aug 25 '22 08:08 wu-sheng

But I remember BanyanDB is not supporting text-based query for log rawtext, right?

We could create an index rule with a simple analyzer to log raw text(its data type should be a string instead of binary) for searching. The simple analyzer will break text at any non-letter character.

hanahmily avatar Aug 25 '22 08:08 hanahmily

If you want to add this capability, it is good. @lujiajing1126 would need to adjust the banyand storage implementation to declare the log query supporting text search, otherwise, the UI would block the query from end users as it is right now.

wu-sheng avatar Aug 25 '22 09:08 wu-sheng