GitPython Failed to read data_stream from a Diff object in python REPL

Version: 2.1.5 Python Version: 3.6.0 Reproducing steps:

I have the following python script:

import git

repo = git.Repo('/tmp/gittest')

commit1 = repo.commit('master')
commit2 = repo.commit('master^')

diffs = commit1.diff(commit2)
diff = diffs[0]

diff.b_blob.data_stream
diff.b_blob.data_stream.read()

If I save it into okay.py and execute python okay.py, everything's fine.

However, if I copy the script and paste it to the python REPL, exception occurs:

root@jacky:source# python
Python 3.6.0 (default, Jan 16 2017, 12:12:55)
[GCC 6.3.1 20170109] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import git
>>>
>>> repo = git.Repo('/tmp/gittest')
>>>
>>> commit1 = repo.commit('master')
>>> commit2 = repo.commit('master^')
>>>
>>> diffs = commit1.diff(commit2)
>>> diff = diffs[0]
>>>
>>> diff.b_blob.data_stream
(b'\x9cY\xe2K\x83\x93\x17\x9a]q-\xe4\xf9\x90\x17\x8d\xf5sM\x99', b'blob', 6, <git.cmd.Git.CatFileContentStream object at 0x7faf444c3e48>)
>>> diff.b_blob.data_stream.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/site-packages/git/objects/base.py", line 112, in data_stream
    return self.repo.odb.stream(self.binsha)
  File "/usr/lib/python3.6/site-packages/git/db.py", line 42, in stream
    hexsha, typename, size, stream = self._git.stream_object_data(bin_to_hex(sha))
  File "/usr/lib/python3.6/site-packages/git/cmd.py", line 957, in stream_object_data
    hexsha, typename, size = self.__get_object_header(cmd, ref)
  File "/usr/lib/python3.6/site-packages/git/cmd.py", line 929, in __get_object_header
    return self._parse_object_header(cmd.stdout.readline())
  File "/usr/lib/python3.6/site-packages/git/cmd.py", line 893, in _parse_object_header
    raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'first' could not be resolved, git returned: b'first'

Why the inconsistency?

Jul 13 '17 06:07 johnlinp

Thanks for the report, I was able to reproduce the issue.

Sep 28 '17 14:09 Byron

I have the same issue, it looks like the same persistent cat-file command is used twice, but the first time, only the first line of output is read. Then, on the second call, stdout still contains the output from the first invocation and that is read instead instead of the summary line it tries to parse.

Apr 16 '20 10:04 SpoonMeiser

The data_stream property contains the note:

:note: returned streams must be read in order

Which makes for a really awkward interface because it temporally couples the calling code, but at least it gives a hint at how this can be worked around. Maybe the easiest initial fix would be to just make this clear in the documentation somewhere.

We've only run into this after upgrading GitPython, so there is an old (maybe really old) version that didn't have this issue.

Apr 16 '20 11:04 SpoonMeiser

Yes, I absolutely agree. It's an implementation detail of the underlying git object database which leaks into the API, and it's a trap that will leave everyone puzzled as to why it happens.

Even though I am responsible for this awkwardness and thus should know, it wasn't obvious to me either.

Another workaround might be to use the GitDB type when instantiating the git repository, as it is a pure-python implementation that accesses data directly. It's slower, and definitely not suited for server processes due to file handles not being released automatically.

Repo('.', odbt=git.db.GitDB)

Apr 23 '20 01:04 Byron

Yes, I absolutely agree. It's an implementation detail of the underlying git object database which leaks into the API, and it's a trap that will leave everyone puzzled as to why it happens.

Even though I am responsible for this awkwardness and thus should know, it wasn't obvious to me either.

Another workaround might be to use the GitDB type when instantiating the git repository, as it is a pure-python implementation that accesses data directly. It's slower, and definitely not suited for server processes due to file handles not being released automatically.
Repo('.', odbt=git.db.GitDB)

Hi, I am using the latest GitPython 3.1.18 and git version 2.30.0 to mine merge scenarios from a repo and still found this error.

Here is the code: https://github.com/Symbolk/MergeScenarioMiner/blob/4af16a6bf893301be27a352c24409b3c5612bae0/main.py#L167

I tried the workaround but it did not solve the root cause but reported:

c1c45b46a36e9725f9741cce25732c69536be075
Ready to process repo: realm-java at branch: master
Traceback (most recent call last):
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/main.py", line 290, in <module>
    git_service.collect_from_commits(['00c9dd117b4b3279c4f48238948005994c90a491'])
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/main.py", line 216, in collect_from_commits
    conflict_file_paths, num_conflicts_per_file = self.collect_merge_scenrios(merge_commit, unmerged_blobs,
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/main.py", line 166, in collect_merge_scenrios
    base_content = blob.data_stream.read()
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/venv/lib/python3.9/site-packages/git/objects/base.py", line 131, in data_stream
    return self.repo.odb.stream(self.binsha)
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/venv/lib/python3.9/site-packages/gitdb/db/base.py", line 208, in stream
    return self._db_query(sha).stream(sha)
  File "/Users/symbolk/coding/dev/MergeScenarioMiner/venv/lib/python3.9/site-packages/gitdb/db/base.py", line 192, in _db_query
    raise BadObject(sha)
gitdb.exc.BadObject: BadObject: b'8517ee7f4378fe0f54945b3e4973766ff65e455d'

Do you think it is a problem of Git or GitPython?

Jul 19 '21 10:07 Symbolk

It's probably a GitPython issue as by now the object database implementation is unlikely to still be complete. Thus it might not see objects that are there, and independently of that it definitely won't see objects that have since been created.

The only correct implementation is the default one as it uses git itself, but it will require the caller to be careful about object references. Depending on what should be accomplished, maybe using libgit2 for python will be a better choice.

Jul 19 '21 15:07 Byron