parseable icon indicating copy to clipboard operation
parseable copied to clipboard

Ingesting high volume of log data

Open mr-karan opened this issue 2 years ago • 4 comments

I've been stress-testing and experimenting with Parseable, locally on my machine.

I've the following setup:

version: '3.8'
services:
  parseable:
    image: parseable/parseable:latest
    command: parseable local-store
    ports:
      - "8000:8000"
    env_file:
      - parseable-env
    volumes:
      - /parseable/staging:/staging
      - /parseable/data:/data
P_ADDR=0.0.0.0:8000
P_USERNAME=admin
P_PASSWORD=admin
P_STAGING_DIR=/staging
P_FS_DIR=/data
[api]
  enabled = true
  address = "0.0.0.0:8686"

[sources.demo]
  type = "demo_logs"
  format = "json"
  interval = 0

[transforms.msg_parser]
  type = "remap"
  inputs = ["demo"]
  source = '''
  . = parse_json!(.message)
  '''

[sinks.parseable]
type = "http"
method = "post"
compression = "gzip"
inputs = ["msg_parser"]
uri = "http://localhost:8000/api/v1/ingest"

[sinks.parseable.batch]
max_bytes = 10485760
max_events = 1000
timeout_secs = 10

[sinks.parseable.encoding]
codec = "json"

[sinks.parseable.auth]
strategy = "basic"
user = "admin"
password = "admin"

[sinks.parseable.request.headers]
X-P-Stream = "vectordemo"

[sinks.parseable.healthcheck]
enabled = true
path = "http://localhost:8000/api/v1/liveness"

With the above Vector pipeline, I generated fair bit of high throughput on the ingest API. About 20mn events in <5 mins :rocket: (Ingesting is super fast :))

image

image

These are dummy logs. A sample is attached:

[
  {
    "bytes": 48822,
    "datetime": "13/Sep/2023:19:26:00",
    "host": "236.17.206.248",
    "method": "DELETE",
    "p_metadata": "",
    "p_tags": "",
    "p_timestamp": "2023-09-13T13:56:00.036",
    "protocol": "HTTP/1.1",
    "referer": "https://for.org/secret-info/open-sesame",
    "request": "/controller/setup",
    "status": "403",
    "user-identifier": "shaneIxD"
  }
]

I've a couple of questions (please let me know if you want me to create separate issues):

  1. Error related to reading these temp files.
vl-demo-parseable-1  | [2023-09-13T13:42:07Z ERROR datafusion::physical_plan::sorts::sort] Failure while reading spill file: NamedTempFile("/tmp/.tmpgiQ7HE/.tmpTrOXFk"). Error: Execution error: channel closed

With the above setup, I am able to reproduce it consistently. Let me know if any more logs/config is needed from my end.

  1. The pagination count is always "200", and I don't see anyway to see the oldest log in the system. Is this some hardcoded limit?

image

  1. Are p_metadata and p_tags mandatory fields?

Ofcourse, the optimal strategy is to use a separate server for these benchmark testing, but I've been experimenting with the API/UI as well, hence doing these tests on local. Is there something else I need to keep in mind when inserting high throughput of logs?

mr-karan avatar Sep 13 '23 14:09 mr-karan

Thanks @mr-karan . This is very cool :)

We'll get back shortly

nitisht avatar Sep 13 '23 14:09 nitisht

@mr-karan

vl-demo-parseable-1 | [2023-09-13T13:42:07Z ERROR datafusion::physical_plan::sorts::sort] Failure while reading spill file: NamedTempFile("/tmp/.tmpgiQ7HE/.tmpTrOXFk"). Error: Execution error: channel closed

This seems to be a datafusion issue. I will try to replicate and report upstream but do you remember when this error occurred?. Were you using logs page or the query page.

I don't see anyway to see the oldest log in the system. Is this some hardcoded limit?

Currently logs page is latest first. So even if the timeframe is large there isn't a clear way to navigate to oldest other than manually changing the time frame to older time. We will roll out a way around this limitation soon.

Are p_metadata and p_tags mandatory fields?

No

trueleo avatar Sep 13 '23 15:09 trueleo

This seems to be a datafusion issue. I will try to replicate and report upstream but do you remember when this error occurred?. Were you using logs page or the query page.

It occurs when I query large amount of data, specifically in the logs page.

mr-karan avatar Sep 13 '23 15:09 mr-karan

@mr-karan We released v0.7.0. https://github.com/parseablehq/parseable/releases/tag/v0.7.0

You can give it a try. It should fix the log page issues you were having.

trueleo avatar Sep 19 '23 15:09 trueleo

Closing this, we're release a distributed version that can be scaled horizontally as load increased and then scaled down to zero.

nitisht avatar Apr 20 '24 07:04 nitisht