Querying diffs is very slow on moderately large repositories
Describe the bug
Queries on diffs for even moderately large repositories are incredibly slow. Our repository at work has ~5,500 commits.
The following operation to get the diff with the most deletions took ~30 minutes:
❯ time .cargo/bin/gitql --query 'select * from diffs order by deletions desc limit 1'
╭──────────────────────────────────────────┬───────────────────┬───────────────────────┬────────────┬───────────┬───────────────┬─────────────────────────┬───────────────────────────────────╮
│ commit_id ┆ name ┆ email ┆ insertions ┆ deletions ┆ files_changed ┆ datetime ┆ repo │
╞══════════════════════════════════════════╪═══════════════════╪═══════════════════════╪════════════╪═══════════╪═══════════════╪═════════════════════════╪═══════════════════════════════════╡
│ 8b685201464c3027afe9105bb5ed9b40a1befce7 ┆ Matthew Planchard ┆ [email protected] ┆ 3284 ┆ 41552 ┆ 212 ┆ 2024-08-15 18:15:45.000 ┆ /home/matthew/s/spec/.git │
╰──────────────────────────────────────────┴───────────────────┴───────────────────────┴────────────┴───────────┴───────────────┴─────────────────────────┴───────────────────────────────────╯
________________________________________________________
Executed in 27.37 mins fish external
usr time 27.25 mins 569.00 micros 27.25 mins
sys time 0.04 mins 0.00 micros 0.04 mins
During the entire time, a single thread was pretty much pegged. I can get this same result using git and awk in a fraction (1/270th, 0.37%) of the time:
❯ time git log --pretty="@%h" --shortstat | tr "\n" " " | tr "@" "\n" | awk '{if ($7 > deletions) { deletions = $7; commit = $1 }}; END { print commit; print deletions }'
8b6852014
41720
________________________________________________________
Executed in 6.01 secs fish external
usr time 5.41 secs 0.00 millis 5.41 secs
sys time 0.63 secs 1.78 millis 0.63 secs
Queries on commits seem to run in a more reasonable amount of time, e.g.:
❯ time .cargo/bin/gitql --query "select count(author_name) from commits where author_name like '%matthew%'"
╭──────────╮
│ column_2 │
╞══════════╡
│ 1001 │
╰──────────╯
________________________________________________________
Executed in 357.45 millis fish external
usr time 351.94 millis 0.00 micros 351.94 millis
sys time 4.62 millis 641.00 micros 3.98 millis
To Reproduce
- Check out any large repo
- Run the example command above
Expected behavior Speed is at least within an order of magnitude of git/awk
GQL (please complete the following information): GitQL version 0.28.0
Additional context Add any other context about the problem here.
Hello @mplanchard,
I am totally agree with you that diffs table should be faster and this can fixed using many ways
- More optimisation in the diff code provider code.
- When finishing the logical plan and planner.
- Support to calculate the diff in multi threads.
But now i am thinking to work step by step to get more optimisation an cover more features in general then moving to optimize specific cases.
But after those features i think we can get the ability to perform more customisable and faster queries
Thank you, Amr
Gitql 0.34.0 is now 50% faster with more functionality on diff content
https://github.com/AmrDeveloper/GQL/releases/tag/0.34.0