astro-sdk
astro-sdk copied to clipboard
Streamline Benchmarking Process and add it to CI
Please describe the feature you'd like to see Currently, we cannot run benchmarking processes for various reasons.
- Datasets not working for all the databases
- Datasets are not available in every file type and size
Describe the solution you'd like Depends on:
- [x] Change benchmark tool to run in the cloud (GCP) - https://github.com/astronomer/astro-sdk/issues/432 - @pankajastro
- [x] Add datasets - https://github.com/astronomer/astro-sdk/issues/574 -
required
- @sunank200 - [x] #883 -
required
- @sunank200 - [x] https://github.com/astronomer/astro-sdk/issues/884 -
required
- @utkarsharma2 - [x] https://github.com/astronomer/astro-sdk/issues/885 -
required
- @sunank200 - [x] https://github.com/astronomer/astro-sdk/issues/886 -
required
- @pankajastro - [x] https://github.com/astronomer/astro-sdk/issues/887 -
required
- @utkarsharma2 - [x] https://github.com/astronomer/astro-sdk/issues/888 -
required
- @utkarsharma2 - [x] https://github.com/astronomer/astro-sdk/issues/914 -
required
- @pankajastro - [ ] https://github.com/astronomer/astro-sdk/issues/1058 -
required
- @utkarsharma2 - [ ] Automate the generation of benchmarking results and posting to results.md -
good to have
- [ ] In case of failure of the benchmark script we should post the message on the Slack Channel. Preferably adding
@author tag
. -good to have
- [ ] In case of success of the benchmark script we can add the results of the CI job in the PR which is used to run it. -
good to have
- [x] Benchmark script is failing to run on GKE with images built on Mac M1- https://github.com/astronomer/astro-sdk/issues/834 -
good to have
- @pankajastro
Note - Need to run operators against real datasets as well, maybe weekly basis. load_file operator for different databases.
- Run multiple load_file tasks per file size basis. -
good to have
- will affect the memory.
Are there any alternatives to this feature? Open for suggestions
Acceptance Criteria
- [ ] All the required tickets should be implemented
@utkarsharma2 following our meeting earlier today:
- Synthetic x Real datasets: synthetic datasets certainly add simplicity to troubleshooting and running things. I'm fine with using this type of dataset as long as we also have a ticket to make sure Astro SDK remains able to run
load_file
- for all supported databases against the real datasets we have at the moment. I'd expect this to be run at the same frequency as the benchmark, including before releases.
My concern with having real datasets is that we have little feedback from our user base, and they have already contributed to us having a more reliable code base. They run without issues for Snowflake and Postgres - and I'd love to see them running smoothly for BQ and Redshift.
Was it considered to use an explicit schema for them?
-
We already have a ticket tocover the CI job: #443
-
Also covered by: #443
-
This should be implemented carefully to avoid the results of concurrent tests affecting one another, particularly Postgres - which was not set to be highly scalable and is running in a container. It is also important to make sure that the current calculations of memory and disk usage remain unaffected.
-
At the moment, the containers are already labelled with the Git hash. The only thing that has to be changed is the K8s Job definition