pdbparse
pdbparse copied to clipboard
Fix DBIStream: true number of NameRef is in the sum of cRefCnt
Hi and thanks for the great library!
I found that when I try to parse PDB for combase.dll with GUID 6c146f310d333559974d1d5d3fa2e4da1, it fails to decode some strings contained in DBI stream structures.
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 554, in parse
return PDB7(f, fast_load)
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 521, in __init__
self.read_root(self.root_stream)
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 460, in read_root
pdb_cls(
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 154, in __init__
self.load()
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/init.py", line 276, in load
debug = dbi.parse_stream(self.stream_file)
File "/opt/venvs/drakrun/lib/python3.8/site-packages/pdbparse/dbi.py", line 160, in parse_stream
Name = ("Name" / CString(encoding = "utf8")).parse(Names[NameRef[j]:])
...
File "/opt/venvs/drakrun/lib/python3.8/site-packages/construct/core.py", line 1490, in _decode
return obj.decode(self.encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 0: invalid start byte
The reason is that cRefCnt is incorrect number of names when the true number exceeds 64K (this field is pretty short, just 16-bit). This behavior is documented here: https://llvm.org/docs/PDB/DbiStream.html#file-info-substream
NumSourceFiles: In theory this is supposed to contain the number of source files for which this substream contains information. But that would present a problem in that the width of this field being 16-bits would prevent one from having more than 64K source files in a program. In early versions of the file format, this seems to have been the case. In order to support more than this, this field of the is simply ignored, and computed dynamically by summing up the values of the ModFileCounts array (discussed below). In short, this value should be ignored.
FileNameOffsets - An array of NumSourceFiles integers (where NumSourceFiles here refers to the 32-bit value obtained from summing ModFileCountArray), where each integer is an offset into NamesBuffer pointing to a null terminated string.
After fix, combase.pdb is parsed correctly.
By the way, I temporarily merged your library code into https://github.com/CERT-Polska/drakpdb as you haven't made any releases for longer time and I can't pin to Git commit if I want to publish dependent package on PyPi.
I need to say that I really like the simplicity of your library and the fact that it doesn't give up when the new, unknown structure or leaf type is reached. I have tested few libraries on current Windows PDBs and pdbparse is the only library so far that is able to deliver basic information about exports and simple structures. I have tried the other solutions like:
llvm-pdbutilthat segfaults onllvm-pdbutil pdb2yaml --all combase_6c146f310d333559974d1d5d3fa2e4da1.pdband that's not the only problem with it as we can see in issues: https://github.com/llvm/llvm-project/issues?q=is%3Aissue+is%3Aopen+pdbutil+- https://github.com/MolecularMatters/raw_pdb example parsers that also segfaults on more complicated PDBs
volatility3 pdbconv.pythat gives up on unknown leaf types: https://github.com/volatilityfoundation/volatility3/issues/182
So I hope you're still interested in maintaining this library and I think I will be coming back with patches from time to time. Cheers!