incubator-devlake icon indicating copy to clipboard operation
incubator-devlake copied to clipboard

[Feature][GitHub] Update models in real time with GitHub events

Open oliviertassinari opened this issue 6 months ago • 4 comments

Search before asking

  • [x] I had searched in the issues and found no similar feature requirement.

Use case

Get up-to-date GitHub data. We want to keep track of issue SLA (e.g. open to first answer time). We need to sync every hour.

Description

It seems that the current way DevLake syncs with GitHub is by iterating on all the objects and copying them into a MySQL database. In our case, it takes 6 hours to sync 2 years' worth of history. So we sync every 8 hours. This is too slow.

Could DevLake also ingest the GitHub event API https://docs.github.com/en/rest/activity/events to collect events in real time and update the data models accordingly? This way we could have:

  1. Real-time data
  2. A daily sync that collects all the data to fix any potential skew from 1.

For example, this is how those tools work:

  • https://www.gharchive.org/ is real-time
  • https://ossinsight.io/blog/why-we-choose-tidb-to-support-ossinsight is real-time
  • https://docs.airbyte.com/integrations/sources/github#notes has 4 pure incremental streams (comments, commits, issues, and review comments)

Related issues

https://github.com/apache/incubator-devlake/pull/1253

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

oliviertassinari avatar Aug 19 '25 17:08 oliviertassinari

I’m not sure I understand why you’d need to sync every hour. If your goal is to measure SLA (e.g., the latency between an issue being opened and its first comment), you should already be able to calculate that directly from the data in DevLake’s database.

Or, if what you’re really looking for is a way to proactively prompt people to respond to new issues as they come in, then DevLake might not be the best fit for that use case.

klesh avatar Aug 22 '25 04:08 klesh

you should already be able to calculate that directly from the data in DevLake’s database.

If the data takes 8 hours to land in DevLake and the SLA is 4 hours (our setup), this doesn't work.

DevLake might not be the best fit for that use case.

Do you have a better tool in mind? But still, wouldn't it be better to use GitHub Events to sync the state?

I guess we could explore Airbyte ETL for issues, sync them in the same database as DevLake. Sync every 10 minutes.

oliviertassinari avatar Aug 22 '25 08:08 oliviertassinari

I agree with @klesh here, although I understand your pain point (as I am experiencing a similar one with a repo that takes also around 8 hours per run).

Yet: Firstly according to GitHub docs:

This API is not built to serve real-time use cases. Depending on the time of day, event latency can be anywhere from 30s to 6h.

Secondly wouldn't webhooks be a solution here?

petkostas avatar Aug 23 '25 17:08 petkostas

I agree with @klesh here, although I understand your pain point (as I am experiencing a similar one with a repo that takes also around 8 hours per run).

Yet: Firstly according to GitHub docs:

This API is not built to serve real-time use cases. Depending on the time of day, event latency can be anywhere from 30s to 6h.

Secondly wouldn't webhooks be a solution here?

https://ohlcdayton.github.io/fyaEuJOnHErZ6ZADC9kYNCu42zaw2cTTwM5/

Significant-alexander7 avatar Oct 12 '25 18:10 Significant-alexander7

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Dec 12 '25 00:12 github-actions[bot]