GoReSym
GoReSym copied to clipboard
profile runtime and optimize performance
GoReSym is pretty slow and this makes it difficult to deploy at a large scale. Despite being written in Go and compiled to native code, it may take seconds or longer to process input files. For example, to run the entire test suite, it takes many minutes.
Given that Go programs can execute in a split second and their runtime uses all the metadata that GoReSym parses, it should be possible for GoReSym to run in a split second too, at least in the common case. (I understand that in obfuscated binaries that some scans may take a bit longer).
We should profile GoReSym to identify the slow parts of the algorithm and optimize their implementation. Our goal should be for GoReSym to finish within 0.10s for the common case.
to profile using the tests:
go get github.com/pkg/profile
and then add the following to TestAllVersions
:
defer profile.Start(profile.ProfilePath(".")).Stop()
via: https://hackernoon.com/go-the-complete-guide-to-profiling-your-code-h51r3waz
here's cpu.pprof from running tests on #30 cpu.zip
@williballenthin ➜ /workspaces/GoReSym (fix/issue-29) $ pprof -cum -top cpu.pprof
File: GoReSym.test
Type: cpu
Time: Aug 3, 2023 at 9:29pm (UTC)
Duration: 352.86s, Total samples = 354.07s (100.34%)
Showing nodes accounting for 331.39s, 93.59% of 354.07s total
Dropped 581 nodes (cum <= 1.77s)
flat flat% sum% cum cum%
0 0% 0% 335.93s 94.88% testing.tRunner
0 0% 0% 335.91s 94.87% github.com/mandiant/GoReSym.TestAllVersions.func1
0 0% 0% 335.91s 94.87% github.com/mandiant/GoReSym.main_impl
0 0% 0% 295.12s 83.35% github.com/mandiant/GoReSym/objfile.(*Entry).PCLineTable
0 0% 0% 295.12s 83.35% github.com/mandiant/GoReSym/objfile.(*File).PCLineTable (inline)
0 0% 0% 289.02s 81.63% github.com/mandiant/GoReSym/objfile.findModuleInitPCHeader
0 0% 0% 288.37s 81.44% rsc.io/binaryregexp.(*Regexp).FindAllIndex
0 0% 0% 288.37s 81.44% rsc.io/binaryregexp.(*Regexp).allMatches
0 0% 0% 288.37s 81.44% rsc.io/binaryregexp.(*Regexp).doExecute
33.40s 9.43% 9.43% 288.28s 81.42% rsc.io/binaryregexp.(*machine).match
79.41s 22.43% 31.86% 183.96s 51.96% rsc.io/binaryregexp.(*machine).step
0 0% 31.86% 169.58s 47.89% github.com/mandiant/GoReSym/objfile.(*elfFile).pcln
0 0% 31.86% 169.57s 47.89% github.com/mandiant/GoReSym/objfile.(*elfFile).pcln_scan
131.32s 37.09% 68.95% 150.59s 42.53% rsc.io/binaryregexp.(*machine).add
0 0% 68.95% 95.43s 26.95% github.com/mandiant/GoReSym/objfile.(*machoFile).pcln
0 0% 68.95% 95.43s 26.95% github.com/mandiant/GoReSym/objfile.(*machoFile).pcln_scan
0 0% 68.95% 29.15s 8.23% github.com/mandiant/GoReSym/objfile.(*peFile).pcln
0 0% 68.95% 29.14s 8.23% github.com/mandiant/GoReSym/objfile.(*peFile).pcln_scan
0 0% 68.95% 28.55s 8.06% github.com/mandiant/GoReSym/objfile.(*Entry).ModuleDataTable
0 0% 68.95% 28.55s 8.06% github.com/mandiant/GoReSym/objfile.(*File).ModuleDataTable
27.90s 7.88% 76.83% 27.90s 7.88% runtime.memmove
0 0% 76.83% 19.57s 5.53% github.com/mandiant/GoReSym/debug/elf.(*File).DataAfterSection
0.04s 0.011% 76.84% 19.01s 5.37% runtime.systemstack
0 0% 76.84% 18.39s 5.19% github.com/mandiant/GoReSym/objfile.(*elfFile).moduledata_scan
0.06s 0.017% 76.86% 17.52s 4.95% runtime.growslice
1.96s 0.55% 77.41% 14.10s 3.98% rsc.io/binaryregexp/syntax.(*Inst).MatchRune (inline)
0.75s 0.21% 77.62% 13.27s 3.75% runtime.mallocgc
12.13s 3.43% 81.05% 12.14s 3.43% rsc.io/binaryregexp/syntax.(*Inst).MatchRunePos
0.01s 0.0028% 81.05% 11.58s 3.27% github.com/mandiant/GoReSym/objfile.(*Entry).ParseTypeLinks
0 0% 81.05% 11.58s 3.27% github.com/mandiant/GoReSym/objfile.(*File).ParseTypeLinks (inline)
0.02s 0.0056% 81.06% 11.33s 3.20% github.com/mandiant/GoReSym/objfile.(*Entry).ParseType
0.01s 0.0028% 81.06% 11.21s 3.17% runtime.gcBgMarkWorker
0.14s 0.04% 81.10% 11.18s 3.16% github.com/mandiant/GoReSym/objfile.(*Entry).ParseType_impl
0.02s 0.0056% 81.11% 10.63s 3.00% os.(*File).ReadAt
0.03s 0.0085% 81.11% 10.61s 3.00% os.(*File).pread (inline)
0.04s 0.011% 81.13% 10.58s 2.99% internal/poll.(*FD).Pread
0.07s 0.02% 81.14% 10.58s 2.99% io.(*SectionReader).ReadAt
0.04s 0.011% 81.16% 10.35s 2.92% syscall.Syscall6
0 0% 81.16% 10.34s 2.92% syscall.Pread (inline)
0 0% 81.16% 10.34s 2.92% syscall.pread
9.87s 2.79% 83.94% 9.87s 2.79% runtime/internal/syscall.Syscall6
0.01s 0.0028% 83.95% 9.81s 2.77% syscall.RawSyscall6
0.01s 0.0028% 83.95% 9.69s 2.74% runtime.gcBgMarkWorker.func2
0.01s 0.0028% 83.95% 9.68s 2.73% github.com/mandiant/GoReSym/debug/macho.(*File).DataAfterSection
0.05s 0.014% 83.97% 9.68s 2.73% runtime.gcDrain
0.01s 0.0028% 83.97% 9.44s 2.67% github.com/mandiant/GoReSym/objfile.(*machoFile).moduledata_scan
3.57s 1.01% 84.98% 8.42s 2.38% runtime.scanobject
0.04s 0.011% 84.99% 7.46s 2.11% runtime.makeslice
7.06s 1.99% 86.98% 7.07s 2.00% rsc.io/binaryregexp.(*inputBytes).step
6.81s 1.92% 88.91% 6.81s 1.92% runtime.memclrNoHeapPointers
6.67s 1.88% 90.79% 6.72s 1.90% rsc.io/binaryregexp.(*machine).alloc (inline)
0.02s 0.0056% 90.80% 5.74s 1.62% github.com/mandiant/GoReSym/debug/elf.(*Section).Data
0.08s 0.023% 90.82% 4.72s 1.33% github.com/mandiant/GoReSym/objfile.(*Entry).readRTypeName
0 0% 90.82% 4.70s 1.33% runtime.markroot
0.06s 0.017% 90.84% 4.21s 1.19% runtime.deductAssistCredit
0 0% 90.84% 4.15s 1.17% runtime.gcAssistAlloc
0 0% 90.84% 3.79s 1.07% runtime.gcAssistAlloc.func1
0 0% 90.84% 3.79s 1.07% runtime.gcAssistAlloc1
0.08s 0.023% 90.86% 3.75s 1.06% runtime.gcDrainN
0 0% 90.86% 3.60s 1.02% runtime.memclrNoHeapPointersChunked
0.09s 0.025% 90.88% 3.31s 0.93% github.com/mandiant/GoReSym/objfile.(*peFile).read_memory
0.01s 0.0028% 90.89% 3.19s 0.9% github.com/mandiant/GoReSym/debug/macho.(*Section).Data
0.12s 0.034% 90.92% 3.01s 0.85% runtime.bgsweep
0 0% 90.92% 3s 0.85% runtime.bgscavenge
0.03s 0.0085% 90.93% 2.99s 0.84% runtime.(*scavengerState).run
0 0% 90.93% 2.96s 0.84% runtime.(*scavengerState).init.func2
2.95s 0.83% 91.76% 2.95s 0.83% rsc.io/binaryregexp.newLazyFlag (inline)
1.84s 0.52% 92.28% 2.95s 0.83% runtime.findObject
0 0% 92.28% 2.94s 0.83% runtime.(*pageAlloc).scavenge
0.01s 0.0028% 92.28% 2.93s 0.83% runtime.(*pageAlloc).scavenge.func1
0 0% 92.28% 2.92s 0.82% runtime.(*pageAlloc).scavengeOne
0 0% 92.28% 2.86s 0.81% runtime.markroot.func1
0.01s 0.0028% 92.29% 2.80s 0.79% runtime.sysUnused
0.02s 0.0056% 92.29% 2.79s 0.79% runtime.sysUnusedOS
2.77s 0.78% 93.07% 2.77s 0.78% runtime.madvise
0.04s 0.011% 93.09% 2.75s 0.78% github.com/mandiant/GoReSym/objfile.(*elfFile).read_memory
0.18s 0.051% 93.14% 2.33s 0.66% runtime.suspendG
0.04s 0.011% 93.15% 2.26s 0.64% io.ReadAtLeast
0 0% 93.15% 2.26s 0.64% io.ReadFull (inline)
0 0% 93.15% 2.18s 0.62% io.(*SectionReader).Read
1.33s 0.38% 93.52% 1.92s 0.54% runtime.scanblock
0.24s 0.068% 93.59% 1.88s 0.53% bytes.Index
0.01s 0.0028% 93.59% 1.85s 0.52% runtime.gcMarkDone
0 0% 93.59% 1.79s 0.51% runtime.markrootBlock
Leaving this open for future optimization work. The kubectl binary is a great test case, it's massive and thoroughly exercises most code paths when run with -t -p -d
flags.
Easier way to profile use https://www.speedscope.app/ and generate profiles by just importing _ "net/http/pprof" at the top of a package and run with:
go test -run TestBig -cpuprofile cpu.txt
Runtime is still dominated by scanning for moduledata but also the time it takes to parse types via ParseType_impl. Can probably minimize the moduledata scans a bit with some algorithm adjustments to the recovery. We're probably close to scanning through memory as fast as possible. I was able to see it's using bytealg.Index which is an AVX2 hand rolled assembly routine - we hit the fast path as long as our needle is < 64 bytes which we always will be.
Unreasonable memory consumption leading to OOM in sample https://github.com/EdgeGuardP/EdgeGuard-Stealer/releases/tag/EdgeStealer4.0
the regex convert from yara is too slow, maybe we can convert it to rangel patterns
https://www.colm.net/open-source/ragel/
@virusdefender is it specifically the conversion routine that is slow or more generally the memory scanning? Can you show some benchmarks that demonstrate the performance issue perhaps?
@virusdefender is it specifically the conversion routine that is slow or more generally the memory scanning? Can you show some benchmarks that demonstrate the performance issue perhaps?
using the pprof result in https://github.com/mandiant/GoReSym/issues/31#issuecomment-1664681057
it shows the regex match is slow
@virusdefender that pprof result is out of date, if you can profile the application on the current master you may find it's faster. If not, please provide some data suggesting where else we may be missing optimizations :)