Explore profiling tools for pipeline tracking and metrics gathering;
Research possible frameworks and tools for CI and overall Nebari deployment to gather future insights into how Nebari currently performs.
This task can be worked in parallel, and we expect to compare notes and possible implications of each framework:
Possible options so far (can be extended further):
- OpenTelemetry (in code)
- Python profiling
- Github custom profiler actions (Otel)
- ...
So, we will start exploring some tools and brainstorming ideas; reducing source code changes and maintainability would be nice. (So that our future selves will be happy :smile: )
Generate an outline of our findings:
- What it is?
- Value and Benefits
- Implications (pros x cons)
This issue is the first step towards addressing #2413. If we know what stages/services/resources are taking the longest time to deploy and destroy, we can identify current bottlenecks and work on solutions to improve our CI feedback time.
At a high level, Nebari uses Terraform under the hood to deploy all the infrastructure required to run a Kubernetes cluster, and then uses the Helm provider to deploy and configure different services inside the cluster (e.g., keycloak, jupyterhub, dask, etc...). At this moment, we just have a rough idea on how long the complete deployment and destruction steps take. Ideally, we should be able to get detailed information about each component involved.
Here are some relevant considerations before deciding what approach we will implement:
- Should this run inside our CI workflows or should we extend it so users can get detailed information on their deployment/destruction duration?
- What kind of data granularity are we striving for? Is a per-stage breakdown enough or do we want to be able to identify up to individual resources?
- Do we want to store profiling information over time, will it be just available for a particular run or will it be part of some kind of internal one-time exercise?
- Is there any intersection with other efforts on improving Nebari (e.g., implementing observability)?
Here are some alternatives I see on how to implement this:
- Use an observability framework to manually instrument the code deploying/destroying the stages and get detailed traces. We can use opentelemetry's Python SDK for this.
- Use a Python profiler. We could use cProfile to wrap the
subprocesscalls we do when deploying/destroying (I don't think the internal logic of rendering files and passing variables between stages is worth to profile as it probably accounts for a very small percentage of the complete execution time). - Use a tool that can parse the Terraform logs to get insights. Seems like https://github.com/datarootsio/tf-profile might be something useful for this.
That is wonderfull summary, @marcelovilla, and I completely agree with the alternative approach you came up with. I also bring up some considerations regarding your first questions, at least on my opinion:
Should this run inside our CI workflows or should we extend it so users can get detailed information on their deployment/destruction duration?
I would like for us to start with things that are easily attachable to our code (such as plugins or extensions) and start the have works first focusing on CI. I found this article about pyroscope that maybe can help generate meaningfully profile data without us interfering with the code itself https://pyroscope.io/blog/ci-profiling-with-pyroscope/
What kind of data granularity are we striving for? Is a per-stage breakdown enough or do we want to be able to identify up to individual resources?
That's an excellent task, and I don't think we will have an answer for this right away; I would instead replace this question with: what do we expect to get with these tests? is there any level of information that is valuable for us right now? if yes, how can we measure it?
Do we want to store profiling information over time? Will it be just available for a particular run, or will it be part of some kind of internal one-time exercise?
The only reason to store that comes to mind would be for reporting, which would be nice for presentations... but until we get a good grasp of how to interpret such data, this looks like a low priority for me.
Also, I was expecting simple reporting with for example execution time for each stage in a final report after the deploy for example, this has its own value and helps a lot when describing how long to wait
@viniciusdc and I met to further discuss this and see what the next steps might be.
We both agree that the less code base changes and dependencies we introduce—at least for now—the better. With this in mind, we decided to leverage the fact that Terraform can produce plain text outputs with information about the duration of each stage creation/destruction. We'll work on making sure we can keep these files in a temporary folder when deploying/destroying Nebari so we can parse them (using tf-profile or our own custom parser) at the end of the Terraform apply process. We'll implement this logic inside a custom plugin using Nebari's extension system. This will allow us to keep both things apart, without introducing changes to Nebari's code base. We still need to decide whether this will run inside our local integration workflow, or even in the cloud providers deployment workflows.
Action items:
- [ ] Make sure we're exporting Terraform logs to a temporary folder after each stage
- [ ] Use
tf-profileor custom logic to parse the logs and get meaningful insights from them - [ ] Implement this logic into a custom plugin using Nebari's extension system
- [ ] Incorporate the use of the custom plugin into one or multiple of our CI workflows
Plugin work repo https://github.com/nebari-dev/nebari-tf-profile-plugin (POC)