[Question][DORA] Ignoring Automated VCS Changes
Question
Hi folks,
We’re using DevLake to track DORA metrics, and we’ve noticed that automated commits and PRs (e.g., from Renovate or Dependabot) are skewing our metrics — especially lead time and deployment frequency. These automated changes usually have close to zero code time and very low pickup/review time, since engineers often merge them as soon as the test pipelines pass. In some cases, they are even automatically merged when all the pipelines succeed.
Is there already a way in DevLake to exclude commits or PRs from specific authors (like bots) when calculating DORA metrics? If not, we’d be interested in contributing a solution, as we believe this could be useful for others as well.
Thanks in advance!
Hello @rodrigoluizs! I'm Dosu, here to help you with bugs, questions, or even becoming a contributor. While you're waiting for a human maintainer, feel free to ask anything!
I found a similar closed issue that might be relevant to your question. In the issue titled "[Feature][Dashboard] Robot removal," a solution was implemented to address the exclusion of bot-generated data. This issue was closed by a pull request that added bot removal patterns to the PR metric documentation [1].
If you're interested in contributing a solution to exclude commits or PRs from specific authors like bots, you can follow the feature request process outlined in the feature request template. This involves searching for existing issues, providing a detailed description of your proposed feature, and indicating your willingness to submit a pull request [2].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
Not that I know of. The idea of it sound great to me, feel free to do so and let me know if you need any help.
Thanks @klesh ! I went ahead and did a bit of analysis on how this could be implemented.
Below is a draft proposal with possible approaches, pros/cons, and a few open questions for feedback. Let me know what you think — happy to iterate on it or prepare an RFC based on where you’d see the most value!
Excluding Automated Bot Changes from DORA Metrics
1. Filtering Approaches
I'm thinking about two main approaches to apply bot filtering in the queries used by Grafana dashboards:
Option 1: Filter dynamically by author name pattern (e.g., %bot%)
SQL example:
SELECT a.full_name, prm.*
FROM devlake.project_pr_metrics prm
INNER JOIN devlake.pull_requests pr ON prm.id = pr.id
INNER JOIN devlake.accounts a ON pr.author_name = a.full_name
WHERE a.full_name NOT LIKE '%bot%'
Pros:
- No schema changes needed
- Quick to implement
Cons:
- Pattern matching is error-prone (some bots might not match)
- Makes queries heavier and harder to maintain
- Hard to manually override for special cases
Option 2: Introduce a flag to pre-classify bots and filter on it
Add a boolean is_bot field to the accounts table and propagate it into project_pr_metrics with a new is_authored_by_bot field.
SQL example:
SELECT *
FROM devlake.project_pr_metrics
WHERE NOT is_authored_by_bot
Pros:
- Queries stay clean and fast
- Easier to manually control (allow exceptions, etc.)
- More future-proof if bot naming conventions change
Cons:
- Requires schema changes and changes to data ingestion
2. How to Populate the New Flag
In case we decide to go for a new flag, we see two options for how the is_bot flag could be populated:
Option 1: Automatic Detection
- When saving accounts, automatically set
is_bot = trueif:-
full_nameoruser_namecontains typical bot patterns likebot,renovate,dependabot, etc.
-
Pros:
- No manual maintenance needed
- Works for most typical bots
Cons:
- Might incorrectly miss or misclassify accounts if naming conventions vary
- This could be addressed by allowing the bot name patterns to be defined in the Git integration configs
Option 2: Manual Control
- Provide an API endpoint (or reuse an existing admin tool) that allows us to manually set or update
is_boton accounts.
Pros:
- Full flexibility for maintainers
- Correct classification possible even for unusual bot names
Cons:
- Requires manual effort to maintain the bot list
Note:
Both approaches could potentially be combined if preferred — automatic detection for most accounts, with manual overrides where needed.
3. Grafana Dashboard Behavior
We also considered two options for displaying the filtered DORA metrics:
Option 1: Add a new variable to existing dashboards
- New Grafana variable:
include_bots- Values:
true,false - Default:
true(bots included) to preserve current behavior and avoid surprising users
- Values:
- Queries would check the variable:
WHERE (${include_bots} = 'true' OR is_authored_by_bot = FALSE)
Pros:
- No need to maintain separate dashboards
- Easy for users to switch views
- Low maintenance overhead
Cons:
- Slightly more complex queries
- Might be overlooked without proper UI hint
Option 2: Create separate dashboards excluding bots
- Clone existing DORA dashboards
- Hardcode filtering (
WHERE is_authored_by_bot = FALSE) - Label clearly, e.g., “DORA Metrics (Human-only Changes)”
Pros:
- Very clear and easy for users to find
- No need for users to toggle anything
Cons:
- Need to maintain two sets of dashboards
- Possible risk of dashboard drift over time
Summary
| Decision Point | Option 1 | Option 2 |
|---|---|---|
| Filtering Method | Author name pattern matching | Pre-calculated is_bot flag |
| Bot Detection | Automatic detection | Manual control (or both combined) |
| Grafana Dashboards | Add a variable to existing dashboards | Create new dashboards for human-only metrics |
Open Questions for Maintainers
- What’s the preferred approach for filtering out automated changes — would dynamic filtering (e.g., by name pattern), a flag-based approach, or something else work best within DevLake’s model?
- For identifying bots, do you see more value in automatic detection, manual, both combined or would you suggest a different mechanism?
- What would be the best way to surface filtered metrics in Grafana — a dashboard variable, a separate dashboard, or another method?
- If this sounds good, should I prepare a full RFC?
Apologies for the delayed response. Thanks for the well-written proposal; all options are clear and well-defined. I believe that the is_bot flag is a good approach and offers long-term flexibility. I suggest combining automated detection (initial identification) with manual overrides for optimal flexibility. Exposing the bot name pattern via an environment variable would provide further customization. Regarding the dashboards, I don't have a strong preference at the moment. Perhaps @Startrekzky can offer some insights?
Hi @rodrigoluizs ,
- For the filtering method, I don't have a preference between the two. And I have two further questions regarding the option 2:
- Which tables will the
is_botflag be added, only tableaccounts? - Which cases will the
is_bottake effect? I'm not worrying about the dashboard queries but the plugin's internal processing logic, for instance, the calculation in the DORA plugin to generate tableproject_pr_metricsmight also take thebotPRs or commits. If so, the plugin's processing logic needs to be updated as well after theis_botflag is introduced.
- Which tables will the
- For bot detection, we could use the
enviroment variableswith default values to achieve bothauto + manual control - For Grafana dashboards, if the
is_botis added, updating the SQL in the existing dashboard would be my choice.
Cc. @klesh
Thanks for the feedback and questions, @Startrekzky and @klesh!
Which tables will the
is_botflag be added, only table accounts?
Yes, the is_bot flag would be added only to the accounts table. In addition to that, my plan was to add a new column to project_pr_metrics called is_authored_by_bot, since I believe that name better reflects the context of the pull request entity and makes the intent clearer when querying.
Which cases will the
is_bottake effect? I'm not worrying about the dashboard queries but the plugin's internal processing logic, for instance, the calculation in the DORA plugin to generate tableproject_pr_metricsmight also take the bot PRs or commits. If so, the plugin's processing logic needs to be updated as well after theis_botflag is introduced.
You’re absolutely right — for this to work reliably, the DORA plugin’s processing logic that populates project_pr_metrics would also need to be updated to propagate the is_bot value from the author account into the is_authored_by_bot field during metric calculation. That way, downstream queries (like Grafana dashboards) can filter without needing to join back to the accounts table.
For bot detection, we could use the environment variables with default values to achieve both auto + manual control
Using environment variables to control the bot name patterns sounds like a great way to support both automatic detection and manual overrides — I’ll incorporate that into the plan as well.
For Grafana dashboards, if the
is_botis added, updating the SQL in the existing dashboard would be my choice.
That was also my preferred approach — nice to hear that you agree!
Just a small note: my intention was to filter on the new is_authored_by_bot column in project_pr_metrics.
Follow-up
Based on your input, my current understanding of the preferred direction is:
-
Filtering approach: Use Option 2 — introduce a flag (
is_botinaccounts, andis_authored_by_botinproject_pr_metrics) - Bot detection: Combine automatic detection with manual override, using an environment variable to define bot name patterns
-
Dashboard behavior:
- Update existing dashboards to support filtering based on
is_authored_by_bot - I’d like your feedback on the idea to introduce an
include_botsvariable to control whether bot-authored changes should be filtered in the queries or not.
The idea here is to avoid introducing a breaking change and to keep the DORA metrics the same for users who do not explicitly opt in to this new feature.
- Update existing dashboards to support filtering based on
Does this align with how you both see it?
Do you agree with the proposed column names — is_bot for the accounts table and is_authored_by_bot for project_pr_metrics?
Additionally, I’d appreciate some clarification on how new environment variables can be introduced in DevLake, as I’m not very familiar with that part of the project yet.
Just want to make sure we’re on the same page before moving forward with an RFC or implementation.
Sounds solid to me.
I am fine with the include_bots, with or without it.
I agree with the column names.
Additionally, I’d appreciate some clarification on how new environment variables can be introduced in DevLake, as I’m not very familiar with that part of the project yet.
Add it to the env.example file with a comment to describe the variable, then update https://github.com/apache/incubator-devlake/blob/main/backend/core/config/config_viper.go#L102 if you want to give it a default value.
@rodrigoluizs Looks great to me.
Based on this approach, you might need to touch every 'git related plugin (gitlab, github, bitbucket)' to populate the newly added fields. @klesh is our lead developer, and please let him know when you have any questions later, here or in Slack.
Thank you for the proposal.
This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.
Implementation is still in progress. Please keep the issue open.
This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.
Implementation is still in progress. Please keep the issue open.
This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.