clp icon indicating copy to clipboard operation
clp copied to clipboard

Search taking lot of time using clg binary

Open bb-rajakarthik opened this issue 1 year ago • 3 comments

Bug

We are using CLP for compressing logs generated by our Kubernetes cluster which are in JSON format. A sample log is given below:

{ "log_time": "2023-08-29T13:55:09.477456Z", "stream": "stdout", "time": "2023-08-29T13:55:09.477456564Z", "@timestamp": "2023-08-29T19:25:09.477+05:30", "@version": "1", "message": " Method: POST;Root=1-64edf8bd-5c762a676349ee71616bb687 , Request Body : {"orders":[{"order_type":"normal","external_reference_id":"69426","items":[{"offset_in_minutes":"721","quantity":"1","external_product_id":"225090"}]}]}", "level": "INFO", "level_value": 20000, "request_id": "6f3f3651-a22b-42a0-b5fe-412d2167c5ca", "kubernetes_docker_id": "caa9102a169a1495e5790cb2c17cb21d0a279ffc50d802d413938870ba59c7c0", } When we are using the clg binary to search through the generated archive using various search queries, It is taking a lot of time to process each query (around 25-30s on average). We only search for request_id and namespace name as mentioned below. It is not feasible for us if the search takes so much time for each archive. Ideally, for one archive 4-5s is the expected search time by us. For example, To search for the request_id in the above log, we generally use the following queries

  1. info6f3f3651-a22b-42a0-b5fe-412d2167c5ca*
  2. 6f3f3651-a22b-42a0-b5fe-412d2167c5ca Our archives are sized 40MB on average wherein the sizes of the internal files and folders of an archive is as given below:
  3. var.dict : 18.2MB
  4. /s: 16.8MB
  5. var.segindex: 2.6MB
  6. metadata.db: 1.9MB
  7. logtype.dict: 736KB
  8. logtype.segindex: 36KB
  9. metadata: 4KB

Is this the correct way to write search queries (the correct way in the sense that will it use the log type and other dictionaries to search through the archives efficiently)? because as mentioned earlier searching through a single archive itself takes more than 30s which is very infeasible. We expect the search time for each archive to be around 4-5s, not more than that for a single archive. Please guide me If I am doing wrong anywhere like the search query being inefficient, etc.

CLP version

3a20c0d2bb831de7fa267d57d187dab8c3f092c1

Environment

UBUNTU 20.04 EC2 instance type: m5.8xlarge

Reproduction steps

NA

bb-rajakarthik avatar Aug 29 '23 14:08 bb-rajakarthik