Description

The goal of this PR is to refactor the Discovery domain so that discoverVertexHandler does not timeout upon making multiple GET requests for claims on the identity provider.

This will fix the issue listed as currently, what is failing the discovery of multiple claims is purely the fact that Discovery.verifyIdentityClaims is not an atomic operation. The method will only return once all ClaimIds and their associated Claims have been retrieved from the identity provider.

In the case of GitHub, where n = number of claims, the requests made for the discovery of a given identity will be ceiling(n / 10) + n. The reason for the ceiling(n / 10) requests is that GitHub will only show 10 Gists per page, meaning that we have to paginate to get all ClaimIds.

Given that the amount of requests, and therefore amount of time taken scales linearly, that is why discoverVertexHandler times out when processing more than 1 Claim.

Task Aggregation

Upon realizing debugging many, many, many tasks might be a pain, I've realized the most painless solution.

Instead of scheduling a task for each request made, we can simply refresh the ctx to emulate this behaviour. Refreshing a ctx will not reset the timer if the timer has already hit the deadline, so it makes it very easy to emulate this behaviour. Furthermore, since the task is rescheduled when lazily cancelled by the task manager stopping, we can just paramaritise our progress before the task manager shuts down.

From here on out, please keep this in mind when I talk about creating new tasks, as I am now doing this instead.

Claim Pagination

Given that the time taken to paginate across a gist search to find all ClaimIds is ceiling(n / 10), we could run into a timeout on this task. This would happen with the current implementation:

Untitled-2023-10-23-0424 excalidraw

In order to apply our timeout deadline to each request individually, we can instead have the task that processes the first page schedule a task to process the second page once it is done:

Untitled-2023-10-23-0424 excalidraw

In order to do this we need a way to paginate through each page of Claims that is:

the least common denominator of all APIs
is serializable

The way we can do this is by returning a generic Opaque String that represents the PaginationToken to to be handed to the next task to execute their query. This type will have no meaning other than being what should be provided to the next function to get the next page.

Claim Requests

Each task to process a page can represent a pipeline to get all of the contents of the claims on that particular page. The directed graph in the order of which tasks are scheduled can be represented as so:

Untitled-2023-10-23-0424 excalidraw

This way, we can streamline the meaning of Discovery.discoverVertexTimeoutTime to be the timeout time of each request in the Discovery task pipelines. By binding the timeout time with a specific expectation, we can expect less unexpected errors such as that found in the issue.

Identities `Provider` Abstract Class Interface

Because we now have to account for pagination + getting claim ids to schedule tasks for requesting each claim, we need to have a getClaimIds async generator method on Provider interface.

We will also need to add paginated versions of the getClaims and getClaimIds methods that will also return a paginationToken. The reason for this, is so that the task to process each request triggered by the method call to a paginated get... method, as well as resume if the task were stopped. This will be a bit messy as we cannot follow our existing pagination conventions, considering that there different providers will use different pagination methods. GitHub will use page index, but Twitter will use tokens and timestamps.

Issues Fixed

Fixes https://github.com/MatrixAI/Polykey-CLI/issues/181

Tasks

[x] 1. Update Identities.Provider abstract class to include paginated API options
[x] 2. Change Discovery.verifyIdentityClaims so that we can schedule a task for requesting and processing each claim separately.
[ ] 3. Implementing progress saving to the Discovery handler
[ ] 4. Implement flag that will prefer usage of getClaimPages directly in the case where data can be aggregated more efficiently

Final checklist

[ ] Domain specific tests
[ ] Full tests
[ ] Updated inline-comment documentation
[ ] Lint fixed
[ ] Squash and rebased
[ ] Sanity check the final build

May 09 '24 03:05 amydevs

This is really well explained, good work!

May 09 '24 04:05 tegefaulkes

Nice! I want to point out that we do have some notion of "time destructuring". https://github.com/MatrixAI/MatrixAI-Graph/issues/42

https://github.com/MatrixAI/MatrixAI-Graph/issues/43 - this is the right one.

May 09 '24 04:05 CMCDragonkai

Regarding your ideas on parallelising the request process. Just beware of a couple of things:

Third party IdPs may change their UI - so it's best to rely on things that are unlikely to change alot (as in prefer things like URL patterns over like specific buttons/text located in specific HTML elements)
So if you want the count of gists... that may not always be the most reliable way of doing this.
Beware of "scraping" detectors, you really need to balance the throughput with reliability.
At the end of the day, I do believe in step-through timing, things need to run in the background with minimal CPU usage... etc - inch by inch is a better idea.

May 09 '24 04:05 CMCDragonkai

hmm alright, i think i'll just stick with the step-through timing then

May 09 '24 06:05 amydevs

Parallelism Draft

However, this is still kind of slow. We would ideally want the pages to be processed in parallel. GH search will provide the amount of total Gists returned from a specific query:

We can simply take this number and apply it to ceiling(n / 10) to figure out how many pages we need to request.

Untitled-2023-10-23-0424 excalidraw

This will dramatically cut down on the amount of time required to get the ClaimIds.

Note, when there are no gist results, a different message is shown instead, so we should expect that the element to find amount of matching gists may not exist.

May 09 '24 07:05 amydevs

according to the new implementation notes in the PR description, I have solved the issue of multiple claims. There are still some optimizations that still need to be done tho (listed in tasks) :333

May 10 '24 05:05 amydevs

I can abunch of warnings about dangling tasks being put in to the queue each time i

run polykey identities discover
run polykey identities list to see that the discovery background task has completed
ctrl-c in the agent window
the agent shuts down gracefully
run polykey agent start again

I'm not exactly sure what these tasks are, given that the task manager is shutting down gracefully and that the discovery background task has already long-completed before i shut down the agent. This problem only happens sometimes, so not what sure it could be... I was consistently getting it, but after my latest commits, i am not

May 10 '24 05:05 amydevs

I think in general tasks need to be checked for any weird errors. The latest staging version still comes up with things like this:

WARN:polykey.PolykeyAgent.task v0pi41vb17do01e9e5btfbaadtc:Failed - Reason: ErrorUtilsUndefinedBehaviour

May 15 '24 06:05 CMCDragonkai

THIS IS NOT A PROBLEM, DISREGARD THIS! THE ROOT CAUSE HAS BEEN IDENTIFIED IN THE PROCEEDING COMMENT

for discovery resumption there needs to be changes in the code path. I have implemented the ProviderPaginationToken in order to support resumption of identity vertex discovery if the TaskManager is shutdown. However, I have come to realize that the discovery of a Node Vetex and all connected Vertices are all performed in a single Task. Therefore, passing a single ProviderPaginationToken around is not enough, as the discovery of a Node Vertex will potentially cause the discovery of multiple other Identity Vertices in one Task.

The solutions are either:

Split the discovery of each Vertex to be a Task each. This might be also useful for streamlining the queueDiscoveryByIdentity and queueDiscoveryByNode methods, as it will allow us to avoid rescheduling discovery for an identity if it is already scheduled to be discovered by the discovery task of a neighbor Vertex.
Pass around a map of ProviderPaginationTokens instead for the claims discovery of each Identity Vertex
Store ProviderPaginationTokens in the Database so that the same Task can be run multiple times, but will resume from the ProviderPaginationToken in the DB.

May 16 '24 04:05 amydevs

The main problem right now, is that I've found discovering a Node vertex will make a Polykey agent request and iterate through the pages of claims on an identity in order to just obtain the providerIdentityClaimId used to index the claim on the Provider service. This is very inefficient, and has been a blocker as it would require us to keep the PaginationToken for Discovery resumption logic per Identity Claim on a given discovered node. Ideally we would want to be able to obtain the providerIdentityClaimId from the peer node directly.

There are 3 options for this I have discussed with @tegefaulkes about:

Create the Claim locally, upload the Claim to the provider, get the ProviderIdentityClaimId, append it to the Claim locally, edit the corresponding remote claim on the provider, publish the claim. This will mean that the ProviderIdentityClaimId is stored identitcally both on the local claim in the SIgChain, as well on the Provider. However, this would rely on the Provider having the ability to a Claim, which may not exist on a service like Twitter. Furthermore, the ProviderIdentityClaimId is redundant when being stored on the Provider, as there is always some way to retrieve it from the Provider already.
Create the Claim locally, publish the Claim to the provider, get the ProviderIdentityClaimId, append it to the Claim locally. This would solve the problem on depending the Provider to have functionality to edit a Claim. However, this would mean that the contents of the Claim deviate from the local copy in the SigChain
Create the Claim locally, publish the Claim to the provider, get the ProviderIdentityClaimId, store it locally in the database. When a peer node calls our nodesClaimsGet RPC handler, we respond not only with the ClaimIdEncoded and SignedTokenEncoded, but also the ProviderIdentityClaimId. However, the issue with this, is that this information is not apart of the provenance history in the sigchain.

Edit:

It has been deemed that solution 2 is the probably the way to go.

May 16 '24 06:05 amydevs

@tegefaulkes roger wanted to let you know that the PR can be handed off to you. The last part is just fixing #723 and then it should be ready to merge

May 20 '24 02:05 amydevs

Sure thing, I'll get to it after addressing #720 and #651

May 20 '24 03:05 tegefaulkes

Alright, I've fixed the main problem that was left. The main solution was to add a flag to scheduleDiscoveryForVertex called ignoreActive to ignore any active tasks when scheduling a new one. We only use this when rescheduling the task inside the task handler.

Besides that I had to apply some small fixes here and there.

This should be ready for review now.

May 22 '24 07:05 tegefaulkes

Alright, I've fixed the main problem that was left. The main solution was to add a flag to scheduleDiscoveryForVertex called ignoreActive to ignore any active tasks when scheduling a new one. We only use this when rescheduling the task inside the task handler.

Besides that I had to apply some small fixes here and there.

This should be ready for review now.

everything seems good, merging now

May 23 '24 02:05 amydevs

Draft: Claim Discovery Task Atomization to Apply Task Deadline Individually Across Each Claim

Description

Task Aggregation

Claim Pagination

Claim Requests

Identities Provider Abstract Class Interface

Issues Fixed

Tasks

Final checklist

Parallelism Draft

THIS IS NOT A PROBLEM, DISREGARD THIS! THE ROOT CAUSE HAS BEEN IDENTIFIED IN THE PROCEEDING COMMENT

Identities `Provider` Abstract Class Interface