borges
borges copied to clipboard
create producer that handles both new repositories and updates
A policy will be needed to schedule:
- New repositories from rovers.
- Update of repositories based on commit frequency
- Update based on repository status (pending or failed)
- Requeue buried jobs
- Add better logging (to see what's beeing fetch on update)
- Etc?
A few questions about this:
- Would this go into the current producer or we want another kind of producer (with a
--updateor something) that keeps working and scheduling jobs? - Regarding commit frequency: Do we score lower the commits as they age (give more score to recent commits)? Do we update the repo every X (being X the commit frequency)? I assume we want to add the frequency to the database every time we update the repository so it's precomputed.
- Why would we update repositories in pending status if they're already queued for processing?
- How often should we retry failed repositories? (right now they are set as
pending, i think, but buried), so this kind of merges with the one of retrying buried jobs. Perhaps failed jobs should be marked aserrorinstead ofpending. Shall we keep a track of the times that a repository has errored consecutively in the database and stop retrying those? - If we mark the ones that are failed (and thus, buried) as
error, why bury them at all if we have a process that requeues them? We can just query forerrored repos and add them to the queue. - What's exactly "add better logging"?
Would this go into the current producer or we want another kind of producer (with a --update or something) that keeps working and scheduling jobs?
I've opened an issue to discuss the organization of borges commands, that is, if we add --update as a parameter or we create a new subcommand: https://github.com/src-d/borges/issues/244
Regarding commit frequency: Do we score lower the commits as they age (give more score to recent commits)? Do we update the repo every X (being X the commit frequency)? I assume we want to add the frequency to the database every time we update the repository so it's precomputed.
Maybe we can use difference between the time of the newest reference and last time updated. Stagnant repositories will have this time difference increasing all the time.
Why would we update repositories in pending status if they're already queued for processing?
Updating pending repos don't make much sense to me either.
How often should we retry failed repositories? (right now they are set as pending, i think, but buried), so this kind of merges with the one of retrying buried jobs. Perhaps failed jobs should be marked as error instead of pending. Shall we keep a track of the times that a repository has errored consecutively in the database and stop retrying those?
I believe that having an error state and the cause of the error would be needed to know whic repos can be updated or left for dead.
What's exactly "add better logging"?
I'm not sure what means here but it would be nice to have more information on repositories that had errors. It's hard to find problems when logs are lost and we don't know what happened.
Silly question - why do we want to distinguish new repos vs updates? What's the benefit? Can we approach the problem that there is always an update, and new repo is just an update of everything/from-scratch/non-existing repo?
@kuba-- we need to treat in a different way because the source is different. New repositories come from rovers (a repository from rovers is called mention because you might already have that repository fetched).
This producer, instead of getting data from rovers, it generates new jobs from the database itself, from repositories that we already have.
At the end, basically all of the data generated by all the producers have the same format, and borges consumer is doing exactly the same work for each of them.