astro-sdk
astro-sdk copied to clipboard
Add benchmarking config
Description
What is the current behavior?
- We don't have the complete benchmarking config added to the main branch.
- As per the earlier discussion we need to run benchmarks in parallel
- Because of a memory leak we need to increase the config for instance. - https://github.com/astronomer/astro-sdk/issues/1058
related: #955
What is the new behavior?
- Added complete grid config
- Added multiple configs to be used in parallel jobs
- Increased the machine size to n2-standard-16
Does this introduce a breaking change?
Nope
Checklist
- [ ] Created tests that fail without the change (if possible)
- [ ] Extended the README / documentation, if necessary
Codecov Report
Base: 94.17% // Head: 94.17% // No change to project coverage :thumbsup:
Coverage data is based on head (
c130b81
) compared to base (47c9554
). Patch has no changes to coverable lines.
Additional details and impacted files
@@ Coverage Diff @@
## main #1012 +/- ##
=======================================
Coverage 94.17% 94.17%
=======================================
Files 13 13
Lines 498 498
Branches 50 50
=======================================
Hits 469 469
Misses 20 20
Partials 9 9
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
The PR description also just says that "we are doing something" and "not why do we need this"
When running end to end, observing the pods are killed due to - OOMKilled
issue
kubectl describe pod -n benchmark benchmark-207ebe0-txn9h
W1031 18:55:22.893163 28770 gcp.go:119] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.26+; use gcloud instead.
To learn more, consult [https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke](https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke)
Name: benchmark-207ebe0-txn9h
Namespace: benchmark
Priority: 0
Service Account: default
Node: gke-astro-sdk-benchmark-4d102245-cmwm/10.128.0.30
Start Time: Mon, 31 Oct 2022 17:51:12 +0530
Labels: controller-uid=83ca59cd-5d4e-4ad8-9317-db6cb7c68b1c
job-name=benchmark-207ebe0
Annotations: <none>
Status: Failed
Reason: Evicted
Message: The node was low on resource: memory. Container benchmark was using 27698652Ki, which exceeds its request of 16Gi.
IP: 10.84.5.10
IPs:
IP: 10.84.5.10
Controlled By: Job/benchmark-207ebe0
Containers:
benchmark:
Container ID: containerd://aaabe8ad80555104ca37b724961068c2fd9889e069dc5619d0f5a55b175ce73d
Image: [[gcr.io/astronomer-dag-authoring/benchmark:207ebe0](http://gcr.io/astronomer-dag-authoring/benchmark:207ebe0)](http://gcr.io/astronomer-dag-authoring/benchmark:207ebe0)
Image ID: [[gcr.io/astronomer-dag-authoring/benchmark@sha256:381642d3f6051beeac015b7f3e81422d37f7177f5b2fc2071c2723a04e0c46ca](http://gcr.io/astronomer-dag-authoring/benchmark@sha256:381642d3f6051beeac015b7f3e81422d37f7177f5b2fc2071c2723a04e0c46ca)](http://gcr.io/astronomer-dag-authoring/benchmark@sha256:381642d3f6051beeac015b7f3e81422d37f7177f5b2fc2071c2723a04e0c46ca)
Port: <none>
Host Port: <none>
Command:
./run.sh
State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 31 Oct 2022 17:51:52 +0530
Finished: Mon, 31 Oct 2022 18:45:35 +0530
Ready: False
Restart Count: 0
Limits:
memory: 32Gi
Requests:
memory: 16Gi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fjkff (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-fjkff:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: [[node.kubernetes.io/not-ready:NoExecute](http://node.kubernetes.io/not-ready:NoExecute)](http://node.kubernetes.io/not-ready:NoExecute) op=Exists for 300s
[[node.kubernetes.io/unreachable:NoExecute](http://node.kubernetes.io/unreachable:NoExecute)](http://node.kubernetes.io/unreachable:NoExecute) op=Exists for 300s
Events:
Type Reason Age From Message
---
Warning Evicted 10m kubelet The node was low on resource: memory. Container benchmark was using 27698652Ki, which exceeds its request of 16Gi.
Normal Killing 10m kubelet Stopping container benchmark
Warning ExceededGracePeriod 10m kubelet Container runtime did not kill the pod within specified grace period.
On further investigation - not able to reproduce the memory leaks locally. All the processes are freeing up the memory as expected.
- 400MB parquet file - Postgres
- 1GB parquet file - Postgres
- 5GB parquet file - Postgres
Merging into the main, since the failing tests are the fixed in main/flaky