camflow-dev
camflow-dev copied to clipboard
How best to represent namespaces?
There is two options that come to my mind:
- As node attributes (current implemented approach);
- As a separate node in the graph? (in the same way, the "machine" is represented?)
Other ideas? This also leads to a more general question: how do we know what is the best way to represent information?
SPADE compatibility is likely to be the simplest decision-making process I'd think? In general though, it would largely depend on a speed / space trade-off wouldn't it, until filtering becomes involved? I can't think off the top of my head where you'd have an instance where inheritance of the namespace wouldn't work well (but I haven't thought hard at all: I'm mostly thinking of PIDs scoped under namespaces). Thus you might save some memory by using graph nodes for namespaces, and queries are likely to be mostly scoped within a given namespace. Queries that cross namespaces, or when the graph might need to get pruned... maybe thse are places node attributes would be better?
Given earlier experiment run by @michael-hahn while working on UNICORN I would suspect there are more fundamental problems where compatibility is concerned. We were always meant to explore this further but never did so.
We may also consider in this discussion how groups and users are represented. At the moment they appear as attributes as well.
All in all, it seems we are circling back to the never solved problem of how systems executions should be represented as a provenance graph.
Regarding representing system executions, I'm not sure that there is a single best solution though, is there? I don't remember if ProvMark touches on this, but otherwise it strikes me as a great experiment to compare query performance of a set of workloads choosing different representations, ideally within the same engine. (Not that there's a current shortage of interesting projects...)
I can be totally wrong here, but if we have a separate node for namespace, wouldn't we basically double the number of nodes in the graph since it is an attribute that every type of node should technically have? I mean, we can probably do smarter things like attaching only one namespace node until for example, a new version of the same node has a new namespace. But then, as @dme26 said, querying can be a bit tricky. At the end of the day, what do we consider semantically a node should represent? There is no right or wrong answers, because you can define it however you want. For example, I remember I have seen a system (cannot for the life of me recall its name) that basically constructs every attribute as a node and in that case, it has different "meta-types" of nodes: some nodes are entity nodes (like a file node) and some are attribute nodes. How does CamFlow want to define its graph semantics then?
I don't think it would double the number of nodes. As mentioned, I was thinking of employing the same approach as for the "machine" node (i.e. attaching it once and then only reattaching it if it has been modified). Which seems to correspond to the "smarter" thing option.
CamFlow tries to represent the state of kernel objects as node + packet and some such transient structure. I think a good thing to do may be to try to write down formally what we are trying to achieve and see if our current implemented model matches our initial intent. Does that sound reasonable? There is a number of cases now that I think about it, where node VS attributes could be debated (e.g. superblock).