grimoirelab icon indicating copy to clipboard operation
grimoirelab copied to clipboard

Github from-data: request only required data

Open bhack opened this issue 3 years ago • 3 comments

right now if you want to analyze github it will fetch all these fields. And here is the github Perceval doc

See https://github.com/chaoss/grimoirelab/issues/428

It seems really a waste of the Github API resources/rate and time limits for the standard Dashboard/Analysis when you need to populate the full project history on the first run.

bhack avatar May 31 '21 15:05 bhack

The original GrimoreLab assumption was: "we want to get all the data for a repo, store it into a database, and then access the database to get specific results". This is probably optimum for analyzing the full story of repositories, but i agree is different from your use case, where (if i understand well), you only need issues and pull requests since some time ago (some days ago, for example).

c&p from https://github.com/chaoss/grimoirelab/issues/428#issuecomment-851595177

sduenas avatar Jun 04 '21 16:06 sduenas

I'm not against this but we need to find a schema that works for some user cases and for all the backends we currently have and there might be several approaches to this. For example, two possible solutions:

  • Lite items: a reduced version of the items that only contains basic information; for example, issues won't have comments, reactions, etc; prs or mrs, won't include comments nor information about revisions.
  • Filter out information: the user will be able to decide which information is filtered out or not retrieved for each item. We already have a case where this happens. The option filter-classified filters out those fields that contain personal or sensible information.
  • Explicit: the user decides what information will be include with the items, comments and reactions, or only reactions, or reactions and reviews.

All of these solutions have pros and cons but my main concern is how the analyzers should handle data when it is not included.

sduenas avatar Jun 04 '21 16:06 sduenas

I think that the primary goal is to have a set that could be populated on a large scale github repository.

E.g. comment and reactions are going to create some high traffic for the API rate limits. What analyzer is going to consume this? If I don't use this analyzer why the crawler is still going to request this data?

So we beed to find a clear mapping between analyzers and the crawlers.

bhack avatar Jun 04 '21 16:06 bhack

Closing this due to no activity. The goal of GrimoireLab was already described on this comment: https://github.com/chaoss/grimoirelab/issues/432#issuecomment-854841252

sduenas avatar Oct 27 '23 16:10 sduenas