overflowdb wip/rfc: prototype for IO for flat storage

This is obviously not mergeable.

But it allows us to explore how IO for flat storage could look like, and what it's performance characteristics are. Apart from uglyness (e.g. use of java serialization for small amounts of metadata), this is missing support for edge properties.

Without further ado, the benchmarks. Test graph is a cpg for the linux kernel via fuzzyc2cpg, with enhancements already applied (create with fuzzyc2cpg and load, which triggers conversion to odb and enhancement passes, then store the resulting enhanced odb file. This is used as basis for tests, and no enhancements run during the benchmarked importCpg), with ~50 million nodes. Tests were run on 64bit linux, openjdk11, with ./ocular.sh -J-Xmx55G -XX:ObjectAlignmentInBytes=16 (object alignment setting verified via attached visualvm), on a box with 64g physical memory, 8 cores / 16 threads and swapping enabled and /sys/kernel/mm/transparent_hugepage/enabled on madvise (enabled is faster on my machine, but I left it on default for the benchmarks). Tests were run on commit https://github.com/ShiftLeftSecurity/overflowdb/pull/222/commits/3402f1eea420503c4f21968a9fd0f0ae2fb7dd45 (if later versions are so WIP that they don't work at all, or the api changed and the benchmark code doesn't run anymore, check out that commit)

Filesize for odb storage is 6.2g, and flatgraph is 1.5g. Compared to odb, improvements are:

Filesize: 4x
heap memory for lazy-loaded graph: 2x
time for lazy loading: 40x - 50x; note that this comparison needs a big asterisk because of slow policy loading (we don't have a lot of C policies, but still)
heap memory for completely loaded graph: 2x
time until everything is loaded with eager loading (i.e. multi-threaded): 40x - 60x
time until everything is loaded sequentially: 10x-20x
Storing a graph is 3x-4x slower than loading a graph. (michael rightly asked for the analogue ratio for odb. I don't know, doesn't seem trivial to test)

@mpollmeier do you understand why cpg.close does not allow us to reclaim the memory?

Conversion from odb graph is slow and unoptimized. We will need to do something about memory spikes when storing graphs -- that stuff needs to move off the JVM heap and live in directbuffers only, which can be managed by native code and that can use realloc (i.e. memory mapping games when using glibc) to move the buffers around.

echo "always" > /sys/kernel/mm/transparent_hugepage/enabled gives a significant speedup, but I don't know see our users doing that.

//odb graph loading, and conversion
workspace.reset; val tic = System.nanoTime; importCpg("/home/bruhns/repos/ocular/fooCopy/cpg.bin"); val toc = System.nanoTime; (toc-tic)*1e-6
//res0_4: Double = 126778.32410299999, after gc: 7.2 gb peak: 14 gb
val tic = System.nanoTime;  overflowdb.FlatGraph.touchGraph(cpg.graph); val toc = System.nanoTime; (toc-tic)*1e-6
//res1_3: Double = 257321.8846 after gc: 24gb, peak: 27gb
val tic = System.nanoTime; val ng = overflowdb.FlatGraph.makeFlatGraph(cpg.graph); val toc = System.nanoTime; (toc-tic)*1e-6
//res2_3: Double = 261221.50329599998 after gc: 35 gb, peak 46 gb
val tic = System.nanoTime; cpg.close; val toc = System.nanoTime; (toc-tic)*1e-6
//res3_3: Double = 31065.666878, after gc: 35gb, peak 37.5gb
val tic = System.nanoTime; val writer = new java.io.RandomAccessFile("/home/bruhns/repos/ocular/fooCopy/cpg.fg", "rw"); ng.store(writer.getChannel, false); writer.close; val toc = System.nanoTime; (toc-tic)*1e-6
//res4_5: Double = 164814.800828, after gc: 35gb peak 59gb / jvm has long gc pauses
//above store time is due to memory limitations -- jvm heap full, OS swapping

//eager load and store:
val tic = System.nanoTime; val reader = new java.io.RandomAccessFile("/home/bruhns/repos/ocular/fooCopy/cpg.fg", "r"); val ng = new overflowdb.FlatGraph.FlatGraphWIP(reader.getChannel, true); reader.close; val toc = System.nanoTime;(toc-tic)*1e-6 
//res0_5: Double = 6798.798844 after gc: 11.5gb peak: 12gb
val tic = System.nanoTime; val writer = new java.io.RandomAccessFile("/home/bruhns/repos/ocular/fooCopy/cpg2.fg", "rw"); ng.store(writer.getChannel, false); writer.close; val toc = System.nanoTime;(toc-tic)*1e-6 
//res1_5: Double = 22740.137909999998 after gc: 11.5gb peak: 25gb

//lazy load and touch
val tic = System.nanoTime; val reader = new java.io.RandomAccessFile("/home/bruhns/repos/ocular/fooCopy/cpg.fg", "r"); val ng = new overflowdb.FlatGraph.FlatGraphWIP(reader.getChannel, false); reader.close; val toc = System.nanoTime;(toc-tic)*1e-6 
//res0_5: Double = 2665.2101789999997 after gc: 3.2gb peak: 3.5gb
val tic = System.nanoTime; overflowdb.FlatGraph.touchGraph(ng);  val toc = System.nanoTime;(toc-tic)*1e-6
//res1_3: Double = 18385.785085 after gc: 11.5gb peak: 11.5gb

//lazy load and store without touch
val tic = System.nanoTime; val reader = new java.io.RandomAccessFile("/home/bruhns/repos/ocular/fooCopy/cpg.fg", "r"); val ng = new overflowdb.FlatGraph.FlatGraphWIP(reader.getChannel, false); reader.close; val toc = System.nanoTime;(toc-tic)*1e-6 
val tic = System.nanoTime; val writer = new java.io.RandomAccessFile("/home/bruhns/repos/ocular/fooCopy/cpg2.fg", "rw"); ng.store(writer.getChannel, false); writer.close; val toc = System.nanoTime;(toc-tic)*1e-6 
//res1_5: Double = 28070.064976999998 after gc: 11.5gb peak: 30gb

Sep 28 '21 13:09 bbrehm

First of all: this is a really big improvement to the status quo, well done and thank you! I repeated your benchmarks by directly comparing only the graph operations, leaving out anything related to cpg (e.g. passes, policies etc.) and got very similar results:

./joern -J-Xmx55G -J-XX:ObjectAlignmentInBytes=16

workspace.reset
def timed[A](fun: => A): A = {
  val start = System.nanoTime
  val ret = fun
  val duration = scala.concurrent.duration.Duration.fromNanos(System.nanoTime - start)
  println(s"===> duration: ${duration.toMillis}ms")
  ret
}

val odbFile = "/home/mp/tmp/cpgtesting/linux-drivers-net.odb" //600m file size

// lazy graph init
val graph = timed(io.shiftleft.codepropertygraph.generated.Cpg.withStorage(odbFile).graph)
// 16s, 1g heap: 10m nodes (refs only)

// fully load graph
timed(graph.edgeCount)
// 55s, 3.7g heap: 10m nodes, 24m edges (fully loaded)

// convert to flatgraph - graph must be fully loaded
overflowdb.FlatGraph.touchGraph(graph)
val ng = timed(overflowdb.FlatGraph.makeFlatGraph(graph))
// 43s, 5.3g heap (combined with old graph)

graph.close

timed {
  val writer = new java.io.RandomAccessFile("cpg.fg", "rw")
  ng.store(writer.getChannel, false)
  writer.close
}
// 4s, file size 134m

// lazy load
// restart joern
def timed[A](fun: => A): A = {
  val start = System.nanoTime
  val ret = fun
  val duration = scala.concurrent.duration.Duration.fromNanos(System.nanoTime - start)
  println(s"===> duration: ${duration.toMillis}ms")
  ret
}
val ng = timed { 
  val reader = new java.io.RandomAccessFile("cpg.fg", "r")
  val ng = new overflowdb.FlatGraph.FlatGraphWIP(reader.getChannel, false)
  reader.close
  ng
}
// 900ms, 560m heap

// load entire graph
timed(overflowdb.FlatGraph.touchGraph(ng))
// 20s, 1.8g heap

Oct 01 '21 09:10 mpollmeier

For my understanding: we would still use the public API of our generated nodes/edges, but compared to of NodeRef/NodeDb they would not prevail in memory all the time (ignoring the overflow for this though experiment), and instead get created when needed, correct?

So the above measured heap sizes are for the internal representation of the graph, and if the user does something like val methods: Seq[Method] = cpg.method.l then these will come on top, right? (Obviously, one should avoid doing so and rather iterate the nodes and not hold a copy - I'm just trying to understand the proposed model.)

If my understanding of the above is correct: does that mean that simply traversing the nodes will instantiate a lot of very short lived objects? Something like cpg.method.name.p - in the old/current model that's very fast and doesn't cause much instantiation and/or GC activity. Would that be different with the proposed model, or how is that going to work?

Oct 01 '21 09:10 mpollmeier

No, the generated NodeDb classes will go away, and only NodeHandle will exist. We will have generated classes that subclass NodeHandle.

There will be (generated) accessors for properties and edges on the subclasses of NodeHandle. These can either be done as instance methods (generated per concrete node-type), or as implicits (generated on the HasFoo trait).

Which one is better is then purely a performance question (we can have a static / implicit accessor on HasFoo that uses the kindId from the instance, or we can have a generated accessor that has hardcoded kindId). I strongly suspect that implicit accessors are faster, because we avoid the megamorphic callsites on CfgNode.cfgOut and the like. We can even generate both, netting us the best of all worlds (no dynamic dispatches when accessing properties of abstract types, and compile-time constant in cases where static type information is unique).

Oct 01 '21 12:10 bbrehm

👍🏻 that all sounds great!

I envisaged that NodeDb will go away, and yes it makes sense to just make the generated nodes subclass NodeHandle, which is essentially a NodeRef replacement then.

The lazy loading mechanism works on a per-node and per-property basis, right? I.e. if i lookup the name property of a particular method node, it'll load all name properties of all method nodes, nothing more and nothing less. Correct?

I guess in theory we could also add a swapping mechanism that allows us to unload certain node/properties combinations, e.g. on low heap, similar to what we do now with the overflow..?

Oct 01 '21 13:10 mpollmeier

The lazy loading mechanism works on a per-node and per-property basis, right? I.e. if i lookup the name property of a particular method node, it'll load all name properties of all method nodes, nothing more and nothing less. Correct?

I guess in theory we could also add a swapping mechanism that allows us to unload certain node/properties combinations, e.g. on low heap, similar to what we do now with the overflow..?

It is that way on all non-string properties and edges.

For strings, the story is different: We really want a stringpool, and store indices into the pool. This is because, e.g. DISPATCH_TYPE on CALL nodes is very repetitive.

The current WIP implementation eagerly loads the entire stringpool (one giant array of all strings), and when you access NAME of METHOD for the first time, it will load the corresponding int-array from disk and build a medium-sized array of strings out of it.

The "lazy-loading" objects (the ReadJob) keeps alive a reference to the giant array-of-string; once all ReadJob's are done, the giant array can get garbage collected.

From this general architecture, we can then decide tradeoffs: Do we want one single stringpool for the entire graph (as it is now)? Or per property-name or per property+nodeLabel combination?

Fewer larger stringpool have smaller the on-disk files and less heap consumption of the entire graph and faster eager loading. More smaller stringpools gives faster lazy loading and less heap consumption for graphs that are mostly unloaded.

A second consideration on that is whether to maintain the stringpool. This WIP currently does not maintain the stringpool, and instead builds a new one on storing the graph. If we maintain the stringpool, then we come into the unhappy situation that we need to do memory management on it (for load graph, modify graph, store graph cycles, we don't want unbounded growth of the pool).

Oct 01 '21 13:10 bbrehm

I'm on board - let's move forward with this! @fabsx00 @ml86 any thoughts?

@bbrehm what's the next steps? Do you want to continue this (e.g. add edge properties, property indices) and clean it up first? Some thoughts that come to mind after that:

Most frontends create odb binaries these days. Your current conversion function makeFlatGraph would require the old storage and generated nodes in classpath, but we could probably rebuild the import on top of just NodeDeserializer, that would probably simplify classpath and build issues.
I quite like the project structure (core, traversal, codegen, integrationtest, sbt-plugin), so maybe we can keep that. I guess this will essentially be a rewrite of core, right? Since the name doesn't quite fit any more: wanna choose a new one?

Oct 01 '21 15:10 mpollmeier

Thumbs up from my side as well. Great work!

Oct 01 '21 15:10 fabsx00

I'm currently working on getting some kind of schema into the prototype.

The idea for that is the following: When making a FlatGraph, we pass a schema template (node factories, what kind of properties are expected, ...); then, on loading we unify the schema from the file format with the schema template, extending it with whatever we found in the file. (goal: At first, we can test with empty schema template)

Second part that comes with it is fast property accessors. That will at first be schema-free, i.e. user asks the graph "give me the accessor for property NAME".

Then, once we have that, we can benchmark real graph operations (e.g. run a depth-first-search over the CFG for each METHOD).

At some point, this ugly prototype will collapse under its own weight -- once that happens, we can start cleaning it up.

I fear that I'm not entirely ready to clean it up now -- large changes / experiments are faster in a big goup of spaghetti code (what this code is atm!), and the structure is not yet crystallized enough to make "proper" classes.

@mpollmeier What do you think about using a java/scala mix for odb-core? The prototype needs to import scala stdlib only for conversion now, but I think we will want scala interfaces as well. (the point of that is to enable zero overhead on access from scala -- java users want a java-iterator, scala users want a scala-iterator, so let's provide one getJIterator() and one getSIterator(). Wrapping a java iterator in .asScala has significant overhead)

Oct 01 '21 15:10 bbrehm

@mpollmeier What do you think about using a java/scala mix for odb-core?

Yup I also it's time to bring scala into core as well - Denis and I were contemplating that a few times in the past since it would have made things easier and simpler, but each individual step wasn't sufficient to actually go ahead and bring it in.

Given that this is already a big experiment, we should probably stick to scala 2 for now (and upgrade later), but if you feel super confident you can use scala 3 straight away. They can be mixed on the same classpath and use the same standard lib.

Oct 02 '21 08:10 mpollmeier

overflowdb overflowdb copied to clipboard

wip/rfc: prototype for IO for flat storage

overflowdb
overflowdb copied to clipboard