gitdb icon indicating copy to clipboard operation
gitdb copied to clipboard

Empty read from gitdb.OStream.read() before EOF

Open lordmauve opened this issue 8 months ago • 6 comments

I have code that relies on reading an object from a gitdb stream.

To do this I used with a standard .read() loop (like with io.RawIOBase):

stream = db.stream(bytes.fromhex(sha))
while chunk := stream.read(4096):
    yield chunk

The behaviour I expected to see (from the duck-type with RawIOBase) is to only see b'' at EOF:

If 0 bytes are returned, and size was not 0, this indicates end of file.

However stream.read(4096) can return empty chunks even before the end of the stream, so the loop exits early.

For the file where I saw this first, it is sensitive to the size parameter - it apparently occurs for 0 < size <= 4096.

Looking at the code there is a condition to repeat a read if we got insufficient bytes:

https://github.com/gitpython-developers/gitdb/blob/f36c0cc42ea2f529291e441073f74e920988d4d2/gitdb/stream.py#L316-L317

However the leading if dcompdat and means that the condition doesn't apply if zero bytes were read. Removing this part of the condition addresses the issue (but I understand from the comment that this is in order to support compressed_bytes_read()).

lordmauve avatar Apr 16 '25 10:04 lordmauve

If I print all chunk sizes with

stream = db.stream(bytes.fromhex(sha))
sz = 0
while sz < stream.size:
    print(len(chunk))
    sz += len(chunk)

there's a spread of sizes:

$ python bad.py | sort -n | uniq -c
      1 0
      1 533
      1 4071
      1 4073
      1 4075
      2 4080
      1 4081
      1 4082
      1 4086
      2 4087
      2 4089
      1 4090
      1 4092
      1 4093
      1 4095
  45840 4096

which seems to refute the idea expressed in this comment that it will recursively read() until the requested size is filled:

https://github.com/gitpython-developers/gitdb/blob/f36c0cc42ea2f529291e441073f74e920988d4d2/gitdb/stream.py#L310-L312

Removing if dcompdat and:

$ python bad.py | sort -n | uniq -c
      1 347
  45856 4096

lordmauve avatar Apr 16 '25 10:04 lordmauve

Thanks for reporting!

I don't think, however, that the implementation can be trusted and it's better to use the git command wrappers provided in GitPython.

Getting a chunk of size 0 in the middle is certainly unexpected, but maybe if that's fixed it will be suitable for consumption nonetheless?

Byron avatar Apr 16 '25 12:04 Byron

it's better to use the git command wrappers provided in GitPython.

GitCmdObjectDB? I've found previously that git cat-file --batch is 16 times slower that gitdb, which is substantial when our monorepo is 200GB. But I do agree that the implementation can't be trusted - I tried using a gitdb instance in threads (but with independent reads) and saw corrupted data. I've also found the best interface to git is generally to wrap git commands. Direct object DB access is the one case where that isn't fast enough and for that I've been using gitdb cautiously.

For the current application I just need to re-hash previously unseen trees/blobs using a different hashing scheme to git, and it has been working OK. Maybe I should re-run the git checksums as well as a sanity check; it would probably still be faster than a git pipe, and then I could debug any issues I detect.

lordmauve avatar Apr 18 '25 09:04 lordmauve

I see. In this case I'd recommend using pygit2 instead if it must be python, or go straight to Rust and gitoxide (or git2).

Byron avatar Apr 18 '25 09:04 Byron

Ah, maybe I should try pygit2. We also had a terrible time with libgit2 in a different application, we swore off it. But that may have been more about the bindings (node-git).

I am happy using Rust in CLI tools but our internal auth stack is not available in Rust, and the two services where we use/could use gitdb would not be cost-effective to rewrite in Rust.

lordmauve avatar Apr 18 '25 10:04 lordmauve

I thought more in the direction of having a little CLI that performs a specific task, to shell out to from the main application. Alternatively, one could do the same but generate bindings. Ultimately, if GitDB works (with some additional protections), then why not use it. But 200GB seemed like one would want to go native.

Byron avatar Apr 18 '25 11:04 Byron