zed
zed copied to clipboard
"pthread_create failed: Resource temporarily unavailable" during Grafana dashboard load
tl;dr
The Zed queries that populate a Grafana dashboard using an S3-based Zed lake service can cause a crash. The opening text of the error dump:
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
Details
Repro is with Zed commit f497079.
This one is kind of complex to reproduce since it's effectively a production use case with many layers involved:
- Grafana
- The grafana-zed-datasource
- The zed-js client
- The Zed lake service
However, since it all ends in a zed serve
crash that I wasn't trying to trigger, I'll start by describing the repro and sharing the dump details, then if necessary I could try to craft a simpler test that exhibits the behaviors to trigger the crash.
The repro involves our "Autoperf" metrics for tracking performance. The metric data is ultimately stored in an S3 bucket, and I run a Zed service locally on my Macbook with the bucket as its lake directory.
$ zed -version
Version: v1.16.0-6-gf4970799
$ zed -lake s3://brimdataio-lakes-us-east-2/shasta.lake.brimdata.io serve
The config for my Zed datasource plugin for Grafana remains at its default setting of http://localhost:9867
.
As shown in the attached video, when I access the dashboard, Grafana begins filling in the panels by executing the Zed queries that populate them. Grafana is in some ways a black box, but I've found that if I "scroll down" in the large dashboard, it seems to trigger loading many panels at once (which is to say, if I stayed near the top, it might be a gentler test by only loading the first few panels, and now that I've found the bug I'm intentionally being more tortuous). Because of the WAN latency from my SF location to the S3 bucket in Ohio, I'm accustomed to it taking a long time for the panels to load. What I've observed is that sometimes many of the latter panels will fail to load and be shown with a red "X". As I expect a user would likely do, at this point I refresh to try loading the dashboard again hoping it will work on the second try. However, as shown here, after another long delay a large error dump appears on the zed serve
console as it exits.
https://github.com/brimdata/zed/assets/5934157/669d6c6d-1416-4855-932f-9d4f6d02b46c
The complete error dump is large, so I've attached it here: zed-serve-log.txt
This issue does not seem to repro if I copy the lake contents to my Macbook and serve the lake off my local SSD instead of S3.
While I can't make sense of the error dump, given what I know of the setup, I have some theories. My first time seeing this crash happened to be right after I added three additional panels to plot metrics for a newly added test. Given what I know of the high latency between my Macbook and the S3 bucket, I would guess that perhaps there are too many query requests "in flight" and it's exhausting some resources.
I respect that this setup is arguably abnormal. That said, if it's possible that I'm correct about it being a problem with too much concurrent work, perhaps there's still something legitimate to be learned, e.g., maybe the same kind of crash could happen if many users were converging on a lake service and each simultaneously happened to be initiating a lesser number of queries in parallel?