Frequent 429 errors from ClearlyDefined
We're getting periodic (but with growing frequency) reports of failures to get a response from ClearlyDefined.
Many projects have integrated Dash License Tool calls into pull request workflows, for example. My (possibly naive) hypothesis is that each individual project is making a reasonable number of calls, but since these are running on shared infrastructure, during periods of high activity, the ClearlyDefined server is getting hammered in aggregate and is throttling. We're also likely competing for resources from other workflows in unrelated GitHub projects that use ClearlyDefined.
Some thoughts on how to deal with this...
- Implement rate limiting. My concern with this is that doing this will increase build times while the workflow is just sitting waiting. Also, if my hypothesis is correct, there's no way of knowing if it would even work: it's possible that we could still get squeezed out by competing processes and wind up in an endless cycle of waiting.
- Strongly encourage projects to configure their frequently executed workflows to treat this as a warning not a blocker.
What do you think, @HannesWell ?
The limits are:
Endpoint Method Limit/Window
/definitions POST 250 /min
/curations POST 250 /min
/notices POST 250 /min
All other endpoints are max 2K requests per minute.
https://docs.clearlydefined.io/docs/get-involved/using-data
and they only are able to track the ip.
they also say
You can check the x-ratelimit-limit and x-ratelimit-remaining response headers to track your usage and reset window.
and i when test the api i see that the rate limit is higher
GET ( we do not use)
x-ratelimit-limit: 5000
x-ratelimit-remaining: 4951
x-ratelimit-reset: 1742738168
POST
x-ratelimit-limit: 0
x-ratelimit-remaining: 0
x-ratelimit-reset: 1742739261
I also opened an issue on: https://github.com/clearlydefined/service/issues/ with number: 1296
To get an better understanding on how this should works
Many projects have integrated Dash License Tool calls into pull request workflows, for example. My (possibly naive) hypothesis is that each individual project is making a reasonable number of calls, but since these are running on shared infrastructure, during periods of high activity, the ClearlyDefined server is getting hammered in aggregate and is throttling. We're also likely competing for resources from other workflows in unrelated GitHub projects that use ClearlyDefined.
I assume that this is not a bad guess. But maybe Clearly defined can be asked to handle such 'big-players' like Github less strict. IIRC their IPs are well known and there is even a GH-API for their IPs.
Besides that I don't know a better solution than waiting longer based on an educated guess as Stefan suggests.
An other option would be that dash becomes a server component and all ci jobs fire against this server which then handles the right limits.
This server could also lookup for new snapshots and releases of the eclipse foundation and check them.
handle rate limit in httpClientService #450
IMHO, respecting rate limiting will buy us a little time, perhaps, to sort out a better solution. The problem is that multiple processes running on the same infrastructure are hammering on the API. Each process pausing for a bit while the others each take their turn and pause, and then everything waking up AT THE SAME TIME might prevent a small number of failures, but I'm thinking that it might actually amplify the problem.
I believe that we get a bigger win by just not using the tool as often. There is no reason to run the Eclipse Dash License Tool on every pull request. If project teams just run it weekly, then the likelihood of running into problems becomes much smaller.
An other option would be that dash becomes a server component and all ci jobs fire against this server which then handles the right limits.
Probably. But, I don't have resources to build -- and, more importantly, maintain and run -- a custom server-based application.
That the Eclipse Dash License Tool runs on the client was a key design feature. If we're going to deploy anything on the server, it'll be something like ORT (which it still on my list of things that we'd like to do).
This server could also lookup for new snapshots and releases of the eclipse foundation and check them.
Move it to the server AND make it actually go looking for work to do? Nope. This is out of scope for the Eclipse Dash License Tool. Perhaps another project could leverage Dash.
If we're going to do this, then we'd deploy ORT.
Just for list all Options in my mind. Not saying that all are good, and in know that there are cons against that you described before.
- set an default proxy over that the comminication is done
- do not use dash to collect data and run a component that takes dependencies from GitHub inights dependencies endpoint and centralise the requests.
I am pritty Happy that dash runs on each pr..
I am pritty Happy that dash runs on each pr..
Why? Even when you're not changing any dependencies?
I understand that there is some non-zero chance that our assessments could change the results, but every pr... really?
Its just convinient, i would forget to start it ever x pr, or everythime when i should think on that. An if i do it scheduled i would not see that it fails. And so others could see (especially in the js/ts world) what stack of software this little change brings.
I believe that we get a bigger win by just not using the tool as often.
FMPOV, adding dependencies needs to be checked ahead. Running dash-license is then the verification of that check. In some cases also updates may cause new indirect dependencies, but that is more a question to use only well managed projects as dependency. So I would guess, as fast solution, a job, which runs "dash-license" once a week will do it very well. If someone has doubts about a new or updated dependency, it's also possible to run it locally.
https://github.com/clearlydefined/service/issues/1296#issuecomment-2843861459
After my issue about the caclulation (error) of the rate limits they seems so change something. looks very promising.
lets see, when they deploy from https://dev-api.clearlydefined.io/ to https://api.clearlydefined.io/
If someone also wants to test: https://dev-api.clearlydefined.io/api-docs/#/definitions/post_definitions
It occurs to me that having the bots authenticate against ClearlyDefined might solve the problem.
https://github.com/clearlydefined/service/issues/1296#issuecomment-2855424861
Sure,
but i am also sure about that the the x-ratelimit that was calculates was wrong on a factor of more that 20.
So would be interesting to see what happens if them deploy the changes to the production.
AFAICT, ClearlyDefined does not appear to formally provide any support for providing a higher rate limits for queries when when you authenticate.
I decided to try it with my GitHub OAUTH anyway and had some disappointing results.
$ curl "https://api.clearlydefined.io/definitions/pypi/pypi/-/starlette/0.47.2" -H "Authorization: token $GITHUB_OAUTH" -v
My first attempt resulted in an HTTP 401 (unauthorised) which suggests that they do look at the token.
Making the call a second time results in an HTTP 200 and payload result. This just seems weird.
I'm concerned that there may be no path forward.
This is standard behavior because HTTP Basic Authentication is defined as a challenge-response mechanism. The client first sends a request without an Authorization header → the server replies with 401 Unauthorized and a WWW-Authenticate: Basic header, indicating that authentication is required. Only then does the client resend the request with credentials. This flow is specified in the HTTP standard (RFC 7617).
It seems to be getting out of hand. Our license chck jobs are constantly failing now and breaking the build status of most of our build jobs in LSP4E and WildWebDeveloper. Please find a solution for this.
Maybe implement a retry with exponential backoff or support caching license check results in the GHA cache between runs in case the hashes of pom.xml/package.json files etc haven't changed etc.
@sebthom Meet me half-way. Running a licence check on every build is not a requirement. The Dash License Tool running successfully on every build is not a requirement.
What is the requirement? License check jobs were already configured when I joined the TM4E/LSP4E and WWD projects. They run for every push to master and each push on each PR.
@sebthom the IP Policy requires that all third-party content be vetted before it is included in or referenced by a release of the software.
My strong preference is that the tool only be run when the dependencies change.
The quick fix is to configure the tool to not fail the build when it fails (-Ddash.fail=false). This will make your build succeed, but doesn't actually solve the fundamental problem.
I believe that the fundamental problem is that we have too many automated builds calling the service from the same set of IP Addresses. Any attempt to backoff (exponentially or otherwise) might improve matters, but not as much as you think (If my working hypothesis is correct and a great meany builds are calling out all of the time, then waiting any amount of time should only help if you get lucky timing-wise). But waiting will 100% increase build times.
Running the tool frequently also increases the risk that you'll create unnecessary review work for the IP Team by creating review requests for content that doesn't end up actually being included in a release (e.g., you may use multiple versions of a component while you're developing a feature, but only one of those versions ends up referenced by a release).
I've started experimenting with some local caching options that I may be able to make work with GHA cache (or at least serve as an example that somebody more knowledgeable than I am in such matters can use as a basis for a contribution to the project).
I completely understand that committers want to be able to fire-and-forget all of this IP stuff... but we actually need committers to be mindful of what they're including in their builds. The tool was originally designed to be run manually by committers from their own workstations.
Thanks for the context. From a maintainer's perspective, deferring license checks until release is at odds with how PR‑driven development works today. We need IP clearance at PR time to merge safely. If we merge without clearance and wait until release, we can easily accumulate dozens of unvetted components, and IP reviews sometimes take weeks, which puts release dates at risk. So PR‑time checks remain essential for us today.
On the "fire‑and‑forget" point: with many contributors across multiple repos, manual steps do not scale. When the central service is flaky or rate‑limited, the lost time multiplies across everyone. The only sustainable path is to make the tool resilient, easy to use and cheap to run when it actually needs to run.
I will continue looking for ways to reduce the load we place on the licensing service. Right now we run license checks on every commit; I'll evaluate changing this so the checks trigger reliably only when dependency manifests/lockfiles change (pom.xml, *.target, package.json, etc.).
However - since we're running on GitHub-hosted runners with shared egress IPs, and ClearlyDefined limits are per IP, jobs from other Eclipse projects and jobs from non-Eclipse projects use the same quota, which may explains the frequent 429s. So even if we reduce our calls, we will still be impacted by rate-limits triggered by others. I currently have license check workflow runs that - even after retries - never succeed. I feel that the Eclipse Foundation really needs to consider providing a more reliable solution, perhaps a caching proxy that wraps around the ClearlyDefined API.
Maybe we should create a dependency file from the dash plugin and only check the items that are new compared to the file?
My strong preference is that the tool only be run when the dependencies change.
I obviously agree with Wayne. That's the way to go. In many cases it may be even possible to run the tool locally.
We need IP clearance at PR time to merge safely.
There maybe different understandings of "clearance". AFAIK, before a new dependency is used at all, someone should/must check the licenses status of that. Best before the PR, because otherwise the work may be in vain. That may be done by running this tool locally or just by have a look at the dependency's license information. Passing that is then the first step and you may then use this tool (again locally) to create some review tickets, if clearlydefined has an other status for that dependency. Then wait, and usually after a short time, all will be "green". That may fail in rare cases and cause some work to revert a PR, when the dependency finally fails to provide a proper license state. But in the last years that never happens.
deferring license checks until release is at odds with how PR‑driven development works today.
So, I hope your contribution guide is explicit about the required license for 3. party stuff?
Maybe, just to better understand the situation:
How many PRs did you receive this year with new 3. party dependencies, and the license state of that hasn't been considered before that PR?
I just want to +1 @sebthom . We are also running dash on every PR whenever it is created or changed. Just running it after many changes may have accumulated is not useful. It should not be the responsibility of the projects do run this manually or re-invent clever caching/diff mechanisms. While I understand the dash maintainer perspective here, from an user perspective it makes what can be a "simple useful" tool into an "annoying process thing to get right and maintain".
Ideally (I know, I know somebody would need to code it) the tool could maybe just talk to the Eclipse Server, and that would talk to clearlydefined. (Also I am not sure of the mechanics, wouldn't over time the clearly defined entries also "Migrate" to the Eclipse Curation DB over time? Or what happens with a high clearlydefined confidence? ist it still put into Eclipse Curation DB or is it ignored in that and you always need to hit ClearlyDefined for those components again?)
fwiw, I ended up restricting the license check to only run on changes to the target file (I.e. only when dependencies actually change), which is similar to what Wayne suggested and which seems to be running fine so far.
We only update 3rd party components perhaps a dozen times per year, so running the check on every PR turns out to be really excessive. Which is also why I'm not sure whether "just" moving the burden to the EF really solves the problem, because you still have the high number of requests that need to be processed.
So one service throwing a 429 would simply be replaced with a different service throwing a 429...
"annoying process thing to get right and maintain".
Annoying was the process before, where the project team needs to create CQ tickets.
perhaps a dozen times per year,
In my experience, that's the case for a lot of projects. There are much much more PRs, which don't change the dependencies. Therefore I asked also about the numbers. I don't think, that someone will spend time into an "improvement", when the users are even not willing to provide their numbers.
So one service throwing a 429 would simply be replaced with a different service throwing a 429...
True enough, but at least that would be in Eclipse hand, so the user group is smaller, we would see what is acceptable, and if it gets out of hand it would be easy to identify the "worst offenders"
Annoying was the process before, where the project team needs to create CQ tickets.
Also agreed, but should IPzilla - or what that was named - really be the benchmark here?
perhaps a dozen times per year,
In my experience, that's the case for a lot of projects. There are much much more PRs, which don't change the dependencies. Therefore I asked also about the numbers.
Understandable, and unfortunately I cannot provide any at the moment. However, I think the main issue would remain: Even if a given project is really considerate, and has a nice automation that only scans 3 new dependencies 3 times a year, if not everyone else is equally "well-behaved" also the good guys might run into the 429, ultimately requiring manual work.
What I can say for our project is, that in main we are living on the bleeding edge, so on case someone updates cargo.lock there will be many "new" dependencies (i.e. new patch levels).
In the end somewhere, manual or automated, some logic is needed like "Did I already ask about this dep, and was it ok?", and I do believe this should be only implemented once to get it right. This might be a more complex Eclipse service, or maybe also something completely different, like e.g. maybe somebody already has a "less dumb" Github Action being able to do that.
Couldn't the dash license check cache the already validated ClearlyDefined entries so ClearlyDefined is no re-queried when a dependency was already approved? Projects have a lot of project-specific things to deal with on their end, Requiring them to create cache/diff of dependencies on their end to compensate the fact that dash license tool doesn't doesn't seem like a service to them.
I acknowledge your frustrations. I share them. Given infinite resources, I'd have had this problem solved by now.
FWIW, I agree that the IPZilla experience should not be our benchmark.
I'm extremely proud of the service that we've been able to build around the Eclipse Dash License Tool and the IPLab backend with extremely limited resources. I am appreciative of the amount of value that we've been able to get out of ClearlyDefined up to this point. But we're clearly at a crossroads.
Here's what we are doing...
We're engaged with the folks at ClearlyDefined to are trying to identify opportunities for us to help them make the service more reliable. We've gained significant value from them over the years, but have only been able to make limited contributions back. We need to do more.
I'm investigating options to have our service make the calls to ClearlyDefined and cache the results. There are some challenges with this approach that I'll document if we decide to pursue it.
We're investigating options to keep a local cache of all ClearlyDefined data with our IPLab data.
I'm also investigating options to have the Eclipse Dash License Tool cache ClearlyDefined data locally. A challenge with this option is that a meaningful local cache takes on different forms on individual workstations, GitHub actions, and Jenkins jobs.
@mickaelistria has provided us a data point for another solution that we may consider. But I respectively request that you check with EMO before you take it upon yourself to creatively explore opportunities that could impact others.