[Feature][GitHub] Update models in real time with GitHub events
Search before asking
- [x] I had searched in the issues and found no similar feature requirement.
Use case
Get up-to-date GitHub data. We want to keep track of issue SLA (e.g. open to first answer time). We need to sync every hour.
Description
It seems that the current way DevLake syncs with GitHub is by iterating on all the objects and copying them into a MySQL database. In our case, it takes 6 hours to sync 2 years' worth of history. So we sync every 8 hours. This is too slow.
Could DevLake also ingest the GitHub event API https://docs.github.com/en/rest/activity/events to collect events in real time and update the data models accordingly? This way we could have:
- Real-time data
- A daily sync that collects all the data to fix any potential skew from 1.
For example, this is how those tools work:
- https://www.gharchive.org/ is real-time
- https://ossinsight.io/blog/why-we-choose-tidb-to-support-ossinsight is real-time
- https://docs.airbyte.com/integrations/sources/github#notes has 4 pure incremental streams (comments, commits, issues, and review comments)
Related issues
https://github.com/apache/incubator-devlake/pull/1253
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
I’m not sure I understand why you’d need to sync every hour. If your goal is to measure SLA (e.g., the latency between an issue being opened and its first comment), you should already be able to calculate that directly from the data in DevLake’s database.
Or, if what you’re really looking for is a way to proactively prompt people to respond to new issues as they come in, then DevLake might not be the best fit for that use case.
you should already be able to calculate that directly from the data in DevLake’s database.
If the data takes 8 hours to land in DevLake and the SLA is 4 hours (our setup), this doesn't work.
DevLake might not be the best fit for that use case.
Do you have a better tool in mind? But still, wouldn't it be better to use GitHub Events to sync the state?
I guess we could explore Airbyte ETL for issues, sync them in the same database as DevLake. Sync every 10 minutes.
I agree with @klesh here, although I understand your pain point (as I am experiencing a similar one with a repo that takes also around 8 hours per run).
Yet: Firstly according to GitHub docs:
This API is not built to serve real-time use cases. Depending on the time of day, event latency can be anywhere from 30s to 6h.
Secondly wouldn't webhooks be a solution here?
I agree with @klesh here, although I understand your pain point (as I am experiencing a similar one with a repo that takes also around 8 hours per run).
Yet: Firstly according to GitHub docs:
This API is not built to serve real-time use cases. Depending on the time of day, event latency can be anywhere from 30s to 6h.
Secondly wouldn't webhooks be a solution here?
https://ohlcdayton.github.io/fyaEuJOnHErZ6ZADC9kYNCu42zaw2cTTwM5/
This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.