Implement LDIF Output Mode
This pull request adds a new LDIF output mode which converts the AD snapshot into (mostly) equivalent ldapsearch output for further processing with the latest BOFHound version.
The intention is to use LDIF as a common input source for BOFHound where all BloodHound related parsing improvements can be shared in a central place, independent of the tool used to gather the information (AD Explorer, ldapsearch BOF, pyldapsearch, ADWS, ...).
This should implement #23 and indirectly "fix" issues like #21, #52 and others.
For testing I did a (objectGUID=*) query over all relevant naming contexts:
LDAP_BASE="DC=ludus,DC=domain"
for dn in "$LDAP_BASE" "CN=Configuration,$LDAP_BASE" "CN=Schema,CN=Configuration,$LDAP_BASE"; do
pyldapsearch ludus.domain/domainadmin:password "(objectGUID=*)" -base-dn "$dn" -output "pyldapsearch.$dn.log" -ldaps
done
Combined all results into one:
$ cat pyldapsearch*.log > combined.log
Sorted the objects in the output by their respective distinguishedName using the following script:
import sys
import re
def read_objects(filename):
with open(filename, 'r', encoding='utf-8') as f:
lines = f.readlines()
separator = None
objects = []
current_object = []
for line in lines:
if re.match(r'^\s*[-]{2,}\s*$', line):
if separator is None:
separator = line.rstrip('\n') # capture the first separator
if current_object:
objects.append("".join(current_object).strip())
current_object = []
else:
current_object.append(line)
if current_object:
objects.append("".join(current_object).strip())
return separator, objects
def get_distinguished_name(object):
match = re.search(r'^distinguishedName:\s*(.+)', object, re.MULTILINE)
return match.group(1) if match else ""
def main():
if len(sys.argv) != 2:
print("Usage: python normalize.py <filename>")
sys.exit(1)
filename = sys.argv[1]
separator, objects = read_objects(filename)
sorted_objects = sorted(objects, key=get_distinguished_name)
for object in sorted_objects:
print(separator)
print(object)
if __name__ == "__main__":
main()
Like this:
$ python3 normalize.py combined.log > pyldapsearch.log
Converted the AD snapshot to LDIF:
$ ADExplorerSnapshot.py -m LDIF ludus.dat
[*] Server: DC01.ludus.domain
[*] Time of snapshot: 2025-04-24T09:11:48
[*] Mapping offset: 0x2a127a
[*] Object count: 3742
[+] Parsing properties: 1499
[+] Parsing classes: 269
[+] Parsing object offsets: 3742
[+] Collecting data: dumped 3742 objects
[+] Output written to DC01.ludus.domain_1745478708_objects.ldif
Sorted this output too:
$ python3 normalize.py DC01.ludus.domain_1745478708_objects.ldif > adexplorer.log
Compared the results:
$ diff -u adexplorer.log pyldapsearch.log
Filtered for changes present in the LDAP query ouput, but not in the snapshot:
$ diff -u adexplorer.log pyldapsearch.log | grep '^\+' -B2 -A2 | less
Observed that the differences are mostly due to changes happening between taking the snapshot and performing the LDAP queries.
Extracted all field names which differ in the LDAP results:
$ diff -u adexplorer.log pyldapsearch.log | grep -a '^\+' | cut -d: -f1 | sort -u
+accountExpires
+creationTime
+dnsRecord
+dSCorePropagationData
+lastLogon
+lastLogonTimestamp
+lastSetTime
+logonCount
+msDS-HasInstantiatedNCs
+otherWellKnownObjects
+priorSetTime
+pwdLastSet
+rIDAllocationPool
+rIDAvailablePool
+rIDPreviousAllocationPool
+uSNChanged
+wellKnownObjects
+whenChanged
Concluded with manual analysis that this mostly expected and should be good enough for further processing with BOFHound.
Parsed the AD explorer originated LDIF output file with BOFHound:
$ bofhound -i adexplorer.log -o adexplorer
_____________________________ __ __ ______ __ __ __ __ _______
| _ / / __ / | ____/| | | | / __ \ | | | | | \ | | | \
| |_) | | | | | | |__ | |__| | | | | | | | | | | \| | | .--. |
| _ < | | | | | __| | __ | | | | | | | | | | . ` | | | | |
| |_) | | `--' | | | | | | | | `--' | | `--' | | |\ | | '--' |
|______/ \______/ |__| |__| |___\_\________\_\________\|__| \___\|_________\
<< @coffeegist | @Tw1sm >>
[10:48:53] INFO Parsed 3741 LDAP objects from 1 log files
[10:48:53] INFO Parsed 0 local group/session objects from 1 log files
[10:48:53] INFO Sorting parsed objects by type...
[10:48:53] INFO Parsed 17 Users
[10:48:53] INFO Parsed 56 Groups
[10:48:53] INFO Parsed 3 Computers
[10:48:53] INFO Parsed 1 Domains
[10:48:53] INFO Parsed 0 Trust Accounts
[10:48:53] INFO Parsed 2 OUs
[10:48:53] INFO Parsed 211 Containers
[10:48:53] INFO Parsed 6 GPOs
[10:48:53] INFO Parsed 1 Enterprise CAs
[10:48:53] INFO Parsed 1 AIA CAs
[10:48:53] INFO Parsed 1 Root CAs
[10:48:53] INFO Parsed 1 NTAuth Stores
[10:48:53] INFO Parsed 5 Issuance Policies
[10:48:53] INFO Parsed 41 Cert Templates
[10:48:53] INFO Parsed 1768 Schemas
[10:48:53] INFO Parsed 1 Referrals
[10:48:53] INFO Parsed 1503 Unknown Objects
[10:48:53] INFO Parsed 0 Sessions
[10:48:53] INFO Parsed 0 Privileged Sessions
[10:48:53] INFO Parsed 0 Registry Sessions
[10:48:53] INFO Parsed 0 Local Group Memberships
[10:48:53] INFO Parsed 2340 ACL relationships
[10:48:53] INFO Created default users
[10:48:53] INFO Created default groups
[10:48:53] INFO Resolved group memberships
[10:48:53] INFO Resolved delegation relationships
[10:48:53] INFO Resolved OU memberships
[10:48:53] INFO Linked GPOs to OUs
[10:48:53] INFO Built CA certificate chains
[10:48:53] INFO Resolved enabled templates per CA
[10:48:53] INFO JSON files written to adexplorer
Compared to the built-in BloodHound output mode you get valid ADCS, GPO, OU and container objects for free.
Thanks for implementing this! Is this a fully-compatible LDIF format or does bofhound have its own format? (E.g. I'm not sure the normal LDIF format has dashes in between lines: https://github.com/c3c/ADExplorerSnapshot.py/pull/69/files#diff-921044e0048a35f62ca97595f22e7cbdd6c1ce05b481b618f024c4bdac1cf32fR197)
Thanks for taking a look.
You are correct, it is not standard compliant according to RFC 2849. Therefore, calling it LDIF is not entirely correct. My goal was basically to get improved BloodHound output with minimal effort. As outlined above, I mostly compared it to pyldapsearch output.
The main differences compared to the standard seem to be:
- Records must be separated by a blank line (not dashes)
- Records must start with a
dn:attribute - Base64 encoded attribute values must be separated by two colons i.e.
name:: dmFsdWU= - Each value of a multi value attribute must be printed on its own line, currently they are comma separated i.e.
objectClass: top
objectClass: domain
instead of:
objectClass: top, domain
- String attributes containing new lines are not handled correctly
The simplest way to get standard compliant output, would probably be to base64 encode all attribute values. This is allowed by the standard, but is obviously not that human readable and would require some BOFHound changes.
Thanks for explaining! I think it would be better then to modify the PR for it to say output mode "BOFHound" ?
That is an option, yes. Another (preferred one?) is to make it actually RFC compliant, I gave it a try in this branch. Haven't yet looked at the changes required on the BOFHound side though.
Sure, could you PR it? Alternatively, I'm also happy to merge in the two different modes.
I've been testing the LDIF (non-RFC) branch against some setups and results are looking good. I think it makes sense to makes this the new default and keep the "direct bloodhound output" as legacy format, while updating the docs to point to bofhound instead for the second conversion.
Side note: some minor conversion issues while running bofhound for loading cert template data + domain data - I'm not sure this is an issue with how we're outputting the data vs how bofhound is loading it.