pdbparse Fix DBIStream: true number of NameRef is in the sum of cRefCnt

Fix DBIStream: true number of NameRef is in the sum of cRefCnt

Open psrok1 opened this issue 1 year ago • 1 comments

Hi and thanks for the great library!

I found that when I try to parse PDB for combase.dll with GUID 6c146f310d333559974d1d5d3fa2e4da1, it fails to decode some strings contained in DBI stream structures.

File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 554, in parse
return PDB7(f, fast_load)
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 521, in __init__
self.read_root(self.root_stream)
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 460, in read_root
pdb_cls(
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 154, in __init__
self.load()
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 276, in load
debug = dbi.parse_stream(self.stream_file)
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/dbi.py", line 160, in parse_stream
Name = ("Name" / CString(encoding = "utf8")).parse(Names[NameRef[j]:])
...
File "/opt/venvs/drakrun/lib/python3.8/site-packages/construct/core.py", line 1490, in _decode
return obj.decode(self.encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 0: invalid start byte

The reason is that cRefCnt is incorrect number of names when the true number exceeds 64K (this field is pretty short, just 16-bit). This behavior is documented here: https://llvm.org/docs/PDB/DbiStream.html#file-info-substream

NumSourceFiles: In theory this is supposed to contain the number of source files for which this substream contains information. But that would present a problem in that the width of this field being 16-bits would prevent one from having more than 64K source files in a program. In early versions of the file format, this seems to have been the case. In order to support more than this, this field of the is simply ignored, and computed dynamically by summing up the values of the ModFileCounts array (discussed below). In short, this value should be ignored.

FileNameOffsets - An array of NumSourceFiles integers (where NumSourceFiles here refers to the 32-bit value obtained from summing ModFileCountArray), where each integer is an offset into NamesBuffer pointing to a null terminated string.

After fix, combase.pdb is parsed correctly.

Jul 30 '24 16:07 psrok1

By the way, I temporarily merged your library code into https://github.com/CERT-Polska/drakpdb as you haven't made any releases for longer time and I can't pin to Git commit if I want to publish dependent package on PyPi.

I need to say that I really like the simplicity of your library and the fact that it doesn't give up when the new, unknown structure or leaf type is reached. I have tested few libraries on current Windows PDBs and pdbparse is the only library so far that is able to deliver basic information about exports and simple structures. I have tried the other solutions like:

llvm-pdbutil that segfaults on llvm-pdbutil pdb2yaml --all combase_6c146f310d333559974d1d5d3fa2e4da1.pdb and that's not the only problem with it as we can see in issues: https://github.com/llvm/llvm-project/issues?q=is%3Aissue+is%3Aopen+pdbutil+
https://github.com/MolecularMatters/raw_pdb example parsers that also segfaults on more complicated PDBs
volatility3 pdbconv.py that gives up on unknown leaf types: https://github.com/volatilityfoundation/volatility3/issues/182

So I hope you're still interested in maintaining this library and I think I will be coming back with patches from time to time. Cheers!

Jul 30 '24 17:07 psrok1

pdbparse pdbparse copied to clipboard

Fix DBIStream: true number of NameRef is in the sum of cRefCnt

pdbparse
pdbparse copied to clipboard