borges icon indicating copy to clipboard operation
borges copied to clipboard

create producer that handles both new repositories and updates

Open smola opened this issue 8 years ago • 4 comments
trafficstars

A policy will be needed to schedule:

  • New repositories from rovers.
  • Update of repositories based on commit frequency
  • Update based on repository status (pending or failed)
  • Requeue buried jobs
  • Add better logging (to see what's beeing fetch on update)
  • Etc?

smola avatar Jun 27 '17 10:06 smola

A few questions about this:

  • Would this go into the current producer or we want another kind of producer (with a --update or something) that keeps working and scheduling jobs?
  • Regarding commit frequency: Do we score lower the commits as they age (give more score to recent commits)? Do we update the repo every X (being X the commit frequency)? I assume we want to add the frequency to the database every time we update the repository so it's precomputed.
  • Why would we update repositories in pending status if they're already queued for processing?
  • How often should we retry failed repositories? (right now they are set as pending, i think, but buried), so this kind of merges with the one of retrying buried jobs. Perhaps failed jobs should be marked as error instead of pending. Shall we keep a track of the times that a repository has errored consecutively in the database and stop retrying those?
  • If we mark the ones that are failed (and thus, buried) as error, why bury them at all if we have a process that requeues them? We can just query for errored repos and add them to the queue.
  • What's exactly "add better logging"?

erizocosmico avatar Apr 03 '18 10:04 erizocosmico

Would this go into the current producer or we want another kind of producer (with a --update or something) that keeps working and scheduling jobs?

I've opened an issue to discuss the organization of borges commands, that is, if we add --update as a parameter or we create a new subcommand: https://github.com/src-d/borges/issues/244

Regarding commit frequency: Do we score lower the commits as they age (give more score to recent commits)? Do we update the repo every X (being X the commit frequency)? I assume we want to add the frequency to the database every time we update the repository so it's precomputed.

Maybe we can use difference between the time of the newest reference and last time updated. Stagnant repositories will have this time difference increasing all the time.

Why would we update repositories in pending status if they're already queued for processing?

Updating pending repos don't make much sense to me either.

How often should we retry failed repositories? (right now they are set as pending, i think, but buried), so this kind of merges with the one of retrying buried jobs. Perhaps failed jobs should be marked as error instead of pending. Shall we keep a track of the times that a repository has errored consecutively in the database and stop retrying those?

I believe that having an error state and the cause of the error would be needed to know whic repos can be updated or left for dead.

What's exactly "add better logging"?

I'm not sure what means here but it would be nice to have more information on repositories that had errors. It's hard to find problems when logs are lost and we don't know what happened.

jfontan avatar Apr 04 '18 09:04 jfontan

Silly question - why do we want to distinguish new repos vs updates? What's the benefit? Can we approach the problem that there is always an update, and new repo is just an update of everything/from-scratch/non-existing repo?

kuba-- avatar Apr 06 '18 15:04 kuba--

@kuba-- we need to treat in a different way because the source is different. New repositories come from rovers (a repository from rovers is called mention because you might already have that repository fetched).

This producer, instead of getting data from rovers, it generates new jobs from the database itself, from repositories that we already have.

At the end, basically all of the data generated by all the producers have the same format, and borges consumer is doing exactly the same work for each of them.

ajnavarro avatar Apr 09 '18 09:04 ajnavarro