Empty read from gitdb.OStream.read() before EOF
I have code that relies on reading an object from a gitdb stream.
To do this I used with a standard .read() loop (like with io.RawIOBase):
stream = db.stream(bytes.fromhex(sha))
while chunk := stream.read(4096):
yield chunk
The behaviour I expected to see (from the duck-type with RawIOBase) is to only see b'' at EOF:
If 0 bytes are returned, and size was not 0, this indicates end of file.
However stream.read(4096) can return empty chunks even before the end of the stream, so the loop exits early.
For the file where I saw this first, it is sensitive to the size parameter - it apparently occurs for 0 < size <= 4096.
Looking at the code there is a condition to repeat a read if we got insufficient bytes:
https://github.com/gitpython-developers/gitdb/blob/f36c0cc42ea2f529291e441073f74e920988d4d2/gitdb/stream.py#L316-L317
However the leading if dcompdat and means that the condition doesn't apply if zero bytes were read. Removing this part of the condition addresses the issue (but I understand from the comment that this is in order to support compressed_bytes_read()).
If I print all chunk sizes with
stream = db.stream(bytes.fromhex(sha))
sz = 0
while sz < stream.size:
print(len(chunk))
sz += len(chunk)
there's a spread of sizes:
$ python bad.py | sort -n | uniq -c
1 0
1 533
1 4071
1 4073
1 4075
2 4080
1 4081
1 4082
1 4086
2 4087
2 4089
1 4090
1 4092
1 4093
1 4095
45840 4096
which seems to refute the idea expressed in this comment that it will recursively read() until the requested size is filled:
https://github.com/gitpython-developers/gitdb/blob/f36c0cc42ea2f529291e441073f74e920988d4d2/gitdb/stream.py#L310-L312
Removing if dcompdat and:
$ python bad.py | sort -n | uniq -c
1 347
45856 4096
Thanks for reporting!
I don't think, however, that the implementation can be trusted and it's better to use the git command wrappers provided in GitPython.
Getting a chunk of size 0 in the middle is certainly unexpected, but maybe if that's fixed it will be suitable for consumption nonetheless?
it's better to use the git command wrappers provided in GitPython.
GitCmdObjectDB? I've found previously that git cat-file --batch is 16 times slower that gitdb, which is substantial when our monorepo is 200GB. But I do agree that the implementation can't be trusted - I tried using a gitdb instance in threads (but with independent reads) and saw corrupted data. I've also found the best interface to git is generally to wrap git commands. Direct object DB access is the one case where that isn't fast enough and for that I've been using gitdb cautiously.
For the current application I just need to re-hash previously unseen trees/blobs using a different hashing scheme to git, and it has been working OK. Maybe I should re-run the git checksums as well as a sanity check; it would probably still be faster than a git pipe, and then I could debug any issues I detect.
I see. In this case I'd recommend using pygit2 instead if it must be python, or go straight to Rust and gitoxide (or git2).
Ah, maybe I should try pygit2. We also had a terrible time with libgit2 in a different application, we swore off it. But that may have been more about the bindings (node-git).
I am happy using Rust in CLI tools but our internal auth stack is not available in Rust, and the two services where we use/could use gitdb would not be cost-effective to rewrite in Rust.
I thought more in the direction of having a little CLI that performs a specific task, to shell out to from the main application. Alternatively, one could do the same but generate bindings. Ultimately, if GitDB works (with some additional protections), then why not use it. But 200GB seemed like one would want to go native.