incubator-devlake
incubator-devlake copied to clipboard
Allow scraping metrics to be throttled via UI
Description
👋 Whilst attempting to scrape metrics from a large Gitlab project (hundreds of thousands of pipelines, tens of thousands of merge requests) the triggered pipeline that scrapes the metrics via the gitlab api was causing increased load to the point users were noticing decreased performance via the UI.
The logs were showing ~19-20 of these calls per second:
GET https://<gitlab_endpoint>/api/v4/projects/<project_id>/merge_requests/<merged_request_id>/notes?system=false&per_page=1&page=0
Proposed solution note: I did check for anything existing that related to throttling but couldn't find it, even in the Gitlab documentation
It would be nice to be able to rate limit the number of api calls made per xx time, or to set a float value to "wait" between api calls in effort to reduce the load. If this could be configured via the config-ui that would be fantastic.
Has the Feature been Requested Before? I couldn't see any via searching similar keywords. Feel free too close if this request is a duplicate.
Describe alternatives you've considered An alternative feature would be to configure the amount of pipelines, merge requests, etc (writing this is scoped to Gitlab at the moment, but can be applied to alternative integrated services) that are queried. For example, if I could configure the last 10_000 merge requests and 25_000 pipelines to be scraped for querying, that would be beneficial in the sense it would reduce the amount of time the scraping would run for as well as provide more recent data for querying.
This is a great request. We should do this feature! Do you want to make a pull request? If not, I can get it into our company pipeline in the coming weeks!
This issue mentioned two methods to implement rate limit:
- Directly limit num of api calls per xx time(e.g. one minute), in this way, we don't need to change many codes, only need to pass one more param from config-ui, then use this param to calculate the interval can be used by time.sleep to limit. Or just send a param to indicate the interval between two api calls.
- Set a number of the most recent merge requests to be queried: I checked both gitlab and github api, they don't have params to limit number, but they have updated_after/created_after(gitlab) and since(github) to limit the number of entries.
below contains api link and params description
- github issue api link: https://docs.github.com/en/github-ae@latest/rest/reference/commits
- since: Only show notifications updated after the given time. This is a timestamp in ISO 8601 format: YYYY-MM-DDTHH:MM:SSZ
- gitlab merge request api link: https://docs.gitlab.com/ee/api/merge_requests.html
- created_after: Return merge requests created on or after the given time. Expected in ISO 8601 format (2019-03-15T08:00:00Z)
- updated_after: Return merge requests updated on or after the given time. Expected in ISO 8601 format (2019-03-15T08:00:00Z)
@yumengwang03 Please take a look at this, this setting is named API_REQUESTS_PER_HOUR in Backend as a Global Default Setting. And we can support setting up a higher priority value on Connection level. Please add this into your connection page/dialog.
And we should also consider to create a setting panel for Global Settings. Including:
API_TIMEOUT=10s
API_RETRY=3
API_REQUESTS_PER_HOUR=10000
# Debug Info Warn Error
LOGGING_LEVEL=
DB_LOGGING_LEVEL=
Plan of attack: Add a setting to the connection page @e2corporation
There should be design first.
@yumengwang03 FYI, we figured that the perfect place for this setting to sit is connection.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.
hi @e2corporation , Can we add this setting to the connection editing page in this iteration?
API swagger issue: https://github.com/apache/incubator-devlake/issues/2449 @e2corporation Can you add a numeric input for rateLimitPerHour in connection edit page?