Ingesting high volume of log data
I've been stress-testing and experimenting with Parseable, locally on my machine.
I've the following setup:
version: '3.8'
services:
parseable:
image: parseable/parseable:latest
command: parseable local-store
ports:
- "8000:8000"
env_file:
- parseable-env
volumes:
- /parseable/staging:/staging
- /parseable/data:/data
P_ADDR=0.0.0.0:8000
P_USERNAME=admin
P_PASSWORD=admin
P_STAGING_DIR=/staging
P_FS_DIR=/data
[api]
enabled = true
address = "0.0.0.0:8686"
[sources.demo]
type = "demo_logs"
format = "json"
interval = 0
[transforms.msg_parser]
type = "remap"
inputs = ["demo"]
source = '''
. = parse_json!(.message)
'''
[sinks.parseable]
type = "http"
method = "post"
compression = "gzip"
inputs = ["msg_parser"]
uri = "http://localhost:8000/api/v1/ingest"
[sinks.parseable.batch]
max_bytes = 10485760
max_events = 1000
timeout_secs = 10
[sinks.parseable.encoding]
codec = "json"
[sinks.parseable.auth]
strategy = "basic"
user = "admin"
password = "admin"
[sinks.parseable.request.headers]
X-P-Stream = "vectordemo"
[sinks.parseable.healthcheck]
enabled = true
path = "http://localhost:8000/api/v1/liveness"
With the above Vector pipeline, I generated fair bit of high throughput on the ingest API. About 20mn events in <5 mins :rocket: (Ingesting is super fast :))
These are dummy logs. A sample is attached:
[
{
"bytes": 48822,
"datetime": "13/Sep/2023:19:26:00",
"host": "236.17.206.248",
"method": "DELETE",
"p_metadata": "",
"p_tags": "",
"p_timestamp": "2023-09-13T13:56:00.036",
"protocol": "HTTP/1.1",
"referer": "https://for.org/secret-info/open-sesame",
"request": "/controller/setup",
"status": "403",
"user-identifier": "shaneIxD"
}
]
I've a couple of questions (please let me know if you want me to create separate issues):
- Error related to reading these temp files.
vl-demo-parseable-1 | [2023-09-13T13:42:07Z ERROR datafusion::physical_plan::sorts::sort] Failure while reading spill file: NamedTempFile("/tmp/.tmpgiQ7HE/.tmpTrOXFk"). Error: Execution error: channel closed
With the above setup, I am able to reproduce it consistently. Let me know if any more logs/config is needed from my end.
- The pagination count is always "200", and I don't see anyway to see the oldest log in the system. Is this some hardcoded limit?
- Are
p_metadataandp_tagsmandatory fields?
Ofcourse, the optimal strategy is to use a separate server for these benchmark testing, but I've been experimenting with the API/UI as well, hence doing these tests on local. Is there something else I need to keep in mind when inserting high throughput of logs?
Thanks @mr-karan . This is very cool :)
We'll get back shortly
@mr-karan
vl-demo-parseable-1 | [2023-09-13T13:42:07Z ERROR datafusion::physical_plan::sorts::sort] Failure while reading spill file: NamedTempFile("/tmp/.tmpgiQ7HE/.tmpTrOXFk"). Error: Execution error: channel closed
This seems to be a datafusion issue. I will try to replicate and report upstream but do you remember when this error occurred?. Were you using logs page or the query page.
I don't see anyway to see the oldest log in the system. Is this some hardcoded limit?
Currently logs page is latest first. So even if the timeframe is large there isn't a clear way to navigate to oldest other than manually changing the time frame to older time. We will roll out a way around this limitation soon.
Are p_metadata and p_tags mandatory fields?
No
This seems to be a datafusion issue. I will try to replicate and report upstream but do you remember when this error occurred?. Were you using logs page or the query page.
It occurs when I query large amount of data, specifically in the logs page.
@mr-karan We released v0.7.0. https://github.com/parseablehq/parseable/releases/tag/v0.7.0
You can give it a try. It should fix the log page issues you were having.
Closing this, we're release a distributed version that can be scaled horizontally as load increased and then scaled down to zero.