Git-Heat-Map
Git-Heat-Map copied to clipboard
add `db-gen` program to create DBs 3x to 20x as fast (powered by `gitoxide`)
The db-gen
program uses gitoxide
to produce diffs in parallel, and despite wasting quite a bit of CPU due to less-than-stellar object access performance for diffs, it still manages to create a linux kernel database in ~~35~~ 21 minutes (M1 Pro).
Tasks
- [x] wait for rename-tracking support in
gitoxide
- [x] optional 'find-copies' , implemented as '--find-copies-harder`
- [x] ordering of diff stats in consumer to get renames into the right order as well
- [x] upgrade to latest
gitoxide
version - [x] fix consistency issue
I ran the version with optimized caches against cpython for the first time and it finished in 16s.
Wow thanks for the input, I was looking into using gitpython to try and generate this. I'm not certain that this can be done in parallel, as the file renaming/creating/deleting detection was very fiddly to get right. I haven't fully read your branch yet, does it account for renaming?
Wow thanks for the input, I was looking into using gitpython to try and generate this.
Glad I came along to prevent this - GitPython isn't good, trust me, I know ;).
I haven't fully read your branch yet, does it account for renaming?
Probably not, as it can't yet do rename tracking. I saw that the python script is relying on an orderly invocation of files, from first commit to last, and that's not done here either.
The good thing is that the order can be re-introduced by adding sequential ids to chunks, so that's absolutely solvable without loosing parallelism. What's more concerning is that rename tracking isn't implemented in gitoxide
yet, so simple rename tracking would have to be implemented here which could then be backported (simple, as in the renamed file wasn't changed and has the same hash).
I took another look at the renaming tracking problem and realized, to my surprise, that the default is to do rename tracking, and to consider 50% similar files for renames. Since we already know how to do diffs, this would just be another version of it, causing many more diffs to be created between the deleted and added files (to determine their similarity).
If you don't mind, please feel free to keep this PR open even without rename tracking, and I will implement it in gitoxide
and be back here to finish it up.
@jmforsythe Would you mind adding a license file to the repository? I am now working on rename tracking within gitoxide
and am considering this program here as an example for 'how to use gitoxide to generate a DB of information from a git repository'. If your license was MIT or Apache, I would be able to do that easily, given that I copied the database definition verbatim. Thank you.
I haven't decided on a license yet. What specifically do you need it for? If it is just the database schema, then go ahead.
Thanks, that's exactly what I would have needed it for. Then I will feel free to use the DB schema as is and probably link to your comment somewhere in the example code for reference and attribution.
@jmforsythe Rename tracking has been implemented in gitoxide
and luckily, it's just as fast as it is before. Copy tracking could also be activated without noticeable cost, but the database schema doesn't support that yet. Please note that the results of the rename tracking might differ rarely, as gitoxide
uses the first suitable candidate whereas git
will try up to 4 candidates and use the best. Hence, gitoxide
is currently less precise.
In any case, please let me know what you think.
Edit: it looks like rename tracking might violate a constraint, which seems to happen when building indices at the very end of the run - probably it wasn't finished yet as the expected runtime was 21 minutes. This is probably a sign that an investigation is needed here, updates will follow.
cargo build --release && rm linux*; /usr/bin/time -lp ./target/release/db-gen /Users/byron/dev/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux -o 800
[..]
21:08:38 traverse commit graph done 1.1M commits in 13.55s (83.9k commits/s)
Error: UNIQUE constraint failed: commitFile.hash, commitFile.fileID
Caused by:
Error code 1555: A PRIMARY KEY constraint failed
real 1151.10
user 10680.99
sys 53.93
5487312896 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
362162 page reclaims
186714 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
100927 voluntary context switches
7134379 involuntary context switches
84275809400944 instructions retired
30851432674389 cycles elapsed
3668195840 peak memory footprint
Git-Heat-Map/db-gen ( faster-db-generation) [?] took 19m23s
Thanks for the patience - I believe the underlying issue was addressed so this implementation will track renames as well. You probably want to validate the tool's output with the baseline as well, which is something I have never done. I'd expect it to be indistinguishable for the most part, yet would be very interested to see a data-diff in case there are indeed differences and how this looks in practice.
I'll try and write some tests soon to validate your generator.