astro-sdk icon indicating copy to clipboard operation
astro-sdk copied to clipboard

Add benchmarking config

Open utkarsharma2 opened this issue 1 year ago • 2 comments

Description

What is the current behavior?

  1. We don't have the complete benchmarking config added to the main branch.
  2. As per the earlier discussion we need to run benchmarks in parallel
  3. Because of a memory leak we need to increase the config for instance. - https://github.com/astronomer/astro-sdk/issues/1058

related: #955

What is the new behavior?

  1. Added complete grid config
  2. Added multiple configs to be used in parallel jobs
  3. Increased the machine size to n2-standard-16

Does this introduce a breaking change?

Nope

Checklist

  • [ ] Created tests that fail without the change (if possible)
  • [ ] Extended the README / documentation, if necessary

utkarsharma2 avatar Oct 06 '22 15:10 utkarsharma2

Codecov Report

Base: 94.17% // Head: 94.17% // No change to project coverage :thumbsup:

Coverage data is based on head (c130b81) compared to base (47c9554). Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1012   +/-   ##
=======================================
  Coverage   94.17%   94.17%           
=======================================
  Files          13       13           
  Lines         498      498           
  Branches       50       50           
=======================================
  Hits          469      469           
  Misses         20       20           
  Partials        9        9           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Oct 06 '22 15:10 codecov[bot]

The PR description also just says that "we are doing something" and "not why do we need this"

kaxil avatar Oct 13 '22 16:10 kaxil

When running end to end, observing the pods are killed due to - OOMKilled issue

kubectl describe pod -n benchmark benchmark-207ebe0-txn9h
W1031 18:55:22.893163   28770 gcp.go:119] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.26+; use gcloud instead.
To learn more, consult [https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke](https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke)
Name:             benchmark-207ebe0-txn9h
Namespace:        benchmark
Priority:         0
Service Account:  default
Node:             gke-astro-sdk-benchmark-4d102245-cmwm/10.128.0.30
Start Time:       Mon, 31 Oct 2022 17:51:12 +0530
Labels:           controller-uid=83ca59cd-5d4e-4ad8-9317-db6cb7c68b1c
job-name=benchmark-207ebe0
Annotations:      <none>
Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: memory. Container benchmark was using 27698652Ki, which exceeds its request of 16Gi.
IP:               10.84.5.10
IPs:
IP:           10.84.5.10
Controlled By:  Job/benchmark-207ebe0
Containers:
benchmark:
Container ID:  containerd://aaabe8ad80555104ca37b724961068c2fd9889e069dc5619d0f5a55b175ce73d
Image:         [[gcr.io/astronomer-dag-authoring/benchmark:207ebe0](http://gcr.io/astronomer-dag-authoring/benchmark:207ebe0)](http://gcr.io/astronomer-dag-authoring/benchmark:207ebe0)
Image ID:      [[gcr.io/astronomer-dag-authoring/benchmark@sha256:381642d3f6051beeac015b7f3e81422d37f7177f5b2fc2071c2723a04e0c46ca](http://gcr.io/astronomer-dag-authoring/benchmark@sha256:381642d3f6051beeac015b7f3e81422d37f7177f5b2fc2071c2723a04e0c46ca)](http://gcr.io/astronomer-dag-authoring/benchmark@sha256:381642d3f6051beeac015b7f3e81422d37f7177f5b2fc2071c2723a04e0c46ca)
Port:          <none>
Host Port:     <none>
Command:
./run.sh
State:          Terminated
Reason:       OOMKilled
Exit Code:    137
Started:      Mon, 31 Oct 2022 17:51:52 +0530
Finished:     Mon, 31 Oct 2022 18:45:35 +0530
Ready:          False
Restart Count:  0
Limits:
memory:  32Gi
Requests:
memory:     16Gi
Environment:  <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-fjkff (ro)
Conditions:
Type              Status
Initialized       True
Ready             False
ContainersReady   False
PodScheduled      True
Volumes:
kube-api-access-fjkff:
Type:                    Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds:  3607
ConfigMapName:           kube-root-ca.crt
ConfigMapOptional:       <nil>
DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 [[node.kubernetes.io/not-ready:NoExecute](http://node.kubernetes.io/not-ready:NoExecute)](http://node.kubernetes.io/not-ready:NoExecute) op=Exists for 300s
[[node.kubernetes.io/unreachable:NoExecute](http://node.kubernetes.io/unreachable:NoExecute)](http://node.kubernetes.io/unreachable:NoExecute) op=Exists for 300s
Events:
Type     Reason               Age   From     Message

---

Warning  Evicted              10m   kubelet  The node was low on resource: memory. Container benchmark was using 27698652Ki, which exceeds its request of 16Gi.
Normal   Killing              10m   kubelet  Stopping container benchmark
Warning  ExceededGracePeriod  10m   kubelet  Container runtime did not kill the pod within specified grace period.

utkarsharma2 avatar Oct 31 '22 13:10 utkarsharma2

On further investigation - not able to reproduce the memory leaks locally. All the processes are freeing up the memory as expected.

  1. 400MB parquet file - Postgres Screenshot 2022-10-31 at 7 28 34 PM
  2. 1GB parquet file - Postgres Screenshot 2022-10-31 at 7 29 54 PM
  3. 5GB parquet file - Postgres Screenshot 2022-10-31 at 7 30 57 PM

utkarsharma2 avatar Oct 31 '22 14:10 utkarsharma2

Merging into the main, since the failing tests are the fixed in main/flaky

utkarsharma2 avatar Nov 04 '22 08:11 utkarsharma2