Draft: Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim
Description
The goal of this PR is to refactor the Discovery domain so that discoverVertexHandler does not timeout upon making multiple GET requests for claims on the identity provider.
This will fix the issue listed as currently, what is failing the discovery of multiple claims is purely the fact that Discovery.verifyIdentityClaims is not an atomic operation. The method will only return once all ClaimIds and their associated Claims have been retrieved from the identity provider.
In the case of GitHub, where n = number of claims, the requests made for the discovery of a given identity will be ceiling(n / 10) + n. The reason for the ceiling(n / 10) requests is that GitHub will only show 10 Gists per page, meaning that we have to paginate to get all ClaimIds.
Given that the amount of requests, and therefore amount of time taken scales linearly, that is why discoverVertexHandler times out when processing more than 1 Claim.
Task Aggregation
Upon realizing debugging many, many, many tasks might be a pain, I've realized the most painless solution.
Instead of scheduling a task for each request made, we can simply refresh the ctx to emulate this behaviour. Refreshing a ctx will not reset the timer if the timer has already hit the deadline, so it makes it very easy to emulate this behaviour. Furthermore, since the task is rescheduled when lazily cancelled by the task manager stopping, we can just paramaritise our progress before the task manager shuts down.
From here on out, please keep this in mind when I talk about creating new tasks, as I am now doing this instead.
Claim Pagination
Given that the time taken to paginate across a gist search to find all ClaimIds is ceiling(n / 10), we could run into a timeout on this task. This would happen with the current implementation:
In order to apply our timeout deadline to each request individually, we can instead have the task that processes the first page schedule a task to process the second page once it is done:
In order to do this we need a way to paginate through each page of Claims that is:
- the least common denominator of all APIs
- is serializable
The way we can do this is by returning a generic Opaque String that represents the PaginationToken to to be handed to the next task to execute their query. This type will have no meaning other than being what should be provided to the next function to get the next page.
Claim Requests
Each task to process a page can represent a pipeline to get all of the contents of the claims on that particular page. The directed graph in the order of which tasks are scheduled can be represented as so:
This way, we can streamline the meaning of Discovery.discoverVertexTimeoutTime to be the timeout time of each request in the Discovery task pipelines. By binding the timeout time with a specific expectation, we can expect less unexpected errors such as that found in the issue.
Identities Provider Abstract Class Interface
Because we now have to account for pagination + getting claim ids to schedule tasks for requesting each claim, we need to have a getClaimIds async generator method on Provider interface.
We will also need to add paginated versions of the getClaims and getClaimIds methods that will also return a paginationToken. The reason for this, is so that the task to process each request triggered by the method call to a paginated get... method, as well as resume if the task were stopped. This will be a bit messy as we cannot follow our existing pagination conventions, considering that there different providers will use different pagination methods. GitHub will use page index, but Twitter will use tokens and timestamps.
Issues Fixed
- Fixes https://github.com/MatrixAI/Polykey-CLI/issues/181
Tasks
- [x] 1. Update
Identities.Providerabstract class to include paginated API options - [x] 2. Change
Discovery.verifyIdentityClaimsso that we can schedule a task for requesting and processing each claim separately. - [ ] 3. Implementing progress saving to the
Discoveryhandler - [ ] 4. Implement flag that will prefer usage of
getClaimPagesdirectly in the case where data can be aggregated more efficiently
Final checklist
- [ ] Domain specific tests
- [ ] Full tests
- [ ] Updated inline-comment documentation
- [ ] Lint fixed
- [ ] Squash and rebased
- [ ] Sanity check the final build
This is really well explained, good work!
Nice! I want to point out that we do have some notion of "time destructuring". https://github.com/MatrixAI/MatrixAI-Graph/issues/42
https://github.com/MatrixAI/MatrixAI-Graph/issues/43 - this is the right one.
Regarding your ideas on parallelising the request process. Just beware of a couple of things:
- Third party IdPs may change their UI - so it's best to rely on things that are unlikely to change alot (as in prefer things like URL patterns over like specific buttons/text located in specific HTML elements)
- So if you want the count of gists... that may not always be the most reliable way of doing this.
- Beware of "scraping" detectors, you really need to balance the throughput with reliability.
- At the end of the day, I do believe in step-through timing, things need to run in the background with minimal CPU usage... etc - inch by inch is a better idea.
hmm alright, i think i'll just stick with the step-through timing then
Parallelism Draft
However, this is still kind of slow. We would ideally want the pages to be processed in parallel. GH search will provide the amount of total Gists returned from a specific query:
We can simply take this number and apply it to ceiling(n / 10) to figure out how many pages we need to request.
This will dramatically cut down on the amount of time required to get the ClaimIds.
Note, when there are no gist results, a different message is shown instead, so we should expect that the element to find amount of matching gists may not exist.
according to the new implementation notes in the PR description, I have solved the issue of multiple claims. There are still some optimizations that still need to be done tho (listed in tasks) :333
I can abunch of warnings about dangling tasks being put in to the queue each time i
- run
polykey identities discover - run
polykey identities listto see that the discovery background task has completed - ctrl-c in the agent window
- the agent shuts down gracefully
- run
polykey agent startagain
I'm not exactly sure what these tasks are, given that the task manager is shutting down gracefully and that the discovery background task has already long-completed before i shut down the agent. This problem only happens sometimes, so not what sure it could be... I was consistently getting it, but after my latest commits, i am not
I think in general tasks need to be checked for any weird errors. The latest staging version still comes up with things like this:
WARN:polykey.PolykeyAgent.task v0pi41vb17do01e9e5btfbaadtc:Failed - Reason: ErrorUtilsUndefinedBehaviour
THIS IS NOT A PROBLEM, DISREGARD THIS! THE ROOT CAUSE HAS BEEN IDENTIFIED IN THE PROCEEDING COMMENT
for discovery resumption there needs to be changes in the code path. I have implemented the ProviderPaginationToken in order to support resumption of identity vertex discovery if the TaskManager is shutdown. However, I have come to realize that the discovery of a Node Vetex and all connected Vertices are all performed in a single Task. Therefore, passing a single ProviderPaginationToken around is not enough, as the discovery of a Node Vertex will potentially cause the discovery of multiple other Identity Vertices in one Task.
The solutions are either:
- Split the discovery of each Vertex to be a Task each. This might be also useful for streamlining the
queueDiscoveryByIdentityandqueueDiscoveryByNodemethods, as it will allow us to avoid rescheduling discovery for an identity if it is already scheduled to be discovered by the discovery task of a neighbor Vertex. - Pass around a map of ProviderPaginationTokens instead for the claims discovery of each Identity Vertex
- Store ProviderPaginationTokens in the Database so that the same Task can be run multiple times, but will resume from the ProviderPaginationToken in the DB.
The main problem right now, is that I've found discovering a Node vertex will make a Polykey agent request and iterate through the pages of claims on an identity in order to just obtain the providerIdentityClaimId used to index the claim on the Provider service. This is very inefficient, and has been a blocker as it would require us to keep the PaginationToken for Discovery resumption logic per Identity Claim on a given discovered node. Ideally we would want to be able to obtain the providerIdentityClaimId from the peer node directly.
There are 3 options for this I have discussed with @tegefaulkes about:
- Create the Claim locally, upload the Claim to the provider, get the
ProviderIdentityClaimId, append it to the Claim locally, edit the corresponding remote claim on the provider, publish the claim. This will mean that theProviderIdentityClaimIdis stored identitcally both on the local claim in the SIgChain, as well on the Provider. However, this would rely on the Provider having the ability to a Claim, which may not exist on a service like Twitter. Furthermore, theProviderIdentityClaimIdis redundant when being stored on the Provider, as there is always some way to retrieve it from the Provider already. - Create the Claim locally, publish the Claim to the provider, get the
ProviderIdentityClaimId, append it to the Claim locally. This would solve the problem on depending the Provider to have functionality to edit a Claim. However, this would mean that the contents of the Claim deviate from the local copy in the SigChain - Create the Claim locally, publish the Claim to the provider, get the
ProviderIdentityClaimId, store it locally in the database. When a peer node calls ournodesClaimsGetRPC handler, we respond not only with theClaimIdEncodedandSignedTokenEncoded, but also theProviderIdentityClaimId. However, the issue with this, is that this information is not apart of the provenance history in the sigchain.
Edit:
It has been deemed that solution 2 is the probably the way to go.
@tegefaulkes roger wanted to let you know that the PR can be handed off to you. The last part is just fixing #723 and then it should be ready to merge
Sure thing, I'll get to it after addressing #720 and #651
Alright, I've fixed the main problem that was left. The main solution was to add a flag to scheduleDiscoveryForVertex called ignoreActive to ignore any active tasks when scheduling a new one. We only use this when rescheduling the task inside the task handler.
Besides that I had to apply some small fixes here and there.
This should be ready for review now.
Alright, I've fixed the main problem that was left. The main solution was to add a flag to
scheduleDiscoveryForVertexcalledignoreActiveto ignore any active tasks when scheduling a new one. We only use this when rescheduling the task inside the task handler.Besides that I had to apply some small fixes here and there.
This should be ready for review now.
everything seems good, merging now