cti-python-stix2 icon indicating copy to clipboard operation
cti-python-stix2 copied to clipboard

Better support references between Cyber Observables

Open gtback opened this issue 7 years ago • 7 comments

Breaking out the conversation from #175, which is really just about documentation.

From @chisholm:

I had a similar problem with my workbench/workflow experiments. I wanted to create network-traffic observables from apache httpd logs. Just the observables, not the SDOs. I wanted SDO construction to be a "downstream" operation: they would be built up from a stream of observables. Problem is, those observables have references, and I needed a way to generate them without having any IDs from an SDO.

What I did was allow the observables to directly refer to other observables, which meant there was no need for an ID. IDs are an indirection mechanism (obs1 -> ID -> obs2); I just cut out the middleman. So what I wound up with was essentially object graphs composed of observables. When creating an observed-data SDO, you can slice up the graph along the ref/refs properties to recover the individual observables. IDs are generated on the fly as needed. Users never have to deal with them, and they can never be wrong.

Obviously, I was not working with python-stix2 for the graphs and their slicing and dicing. It was plain dicts. I had another workflow component that would transform a stream of dicts to a stream of objects by calling stix2.parse() on them all.

An ideal solution would be one where users don't need to make up their own ~ideas~ IDs, and don't even need to be aware of them. In python-stix2, because objects are immutable, it's possible (I think) to get into situations where there are circular references, such that there's no "first object" you can create without references to other objects that haven't been created yet. Did you run into this with your dict approach, @chisholm? How did you deal with it?

EDIT: obviously I was only semi-conscious writing this.

gtback avatar May 10 '18 12:05 gtback

An ideal solution would be one where users don't need to make up their own ideas, and don't even need to be aware of them.

Are you suggesting we need additional APIs? Of course I expect that functions to slice apart these graphs (and anything else that seems helpful) would be provided. Hmmm... not sure if this was intended to be supportive of the idea or not. :)

In python-stix2, because objects are immutable, it's possible (I think) to get into situations where there are circular references

There are circular references because of immutability? Mutable objects wouldn't have circular references? Can you explain?

Did you run into this with your dict approach? How did you deal with it?

I did have a check for cycles, but I didn't think legitimate observable graphs would have them. So I just raised an exception. I'm sure they can be handled, but it would add some more complexity within the transformation process (i.e. from a graph which is direct to indirect-via-IDs).

chisholm avatar May 14 '18 18:05 chisholm

No, the circular references aren't because of immutabilty. I just meant that, currently, you can do something like this:

a = Foo(bar_ref='0', _valid_refs={'0': 'bar'})
b = Bar(foo_ref='1', _valid_refs={'1': 'foo})
obs = ObservedData(objects={'0': b, '1': a})

Trying to do something like this while automating references would be impossible due to Python naming rules:

a = Foo(bar=b)
b = Bar(foo=a)
obs = ObservedData(objects=[a,b])

And because of immutability, you wouldn't be able to do a = Foo() and later a.bar = b.

I'm not sure this situation would come up in any actual Cyber Observable objects.

gtback avatar May 15 '18 21:05 gtback

Right, immutability makes it impossible to construct an ObservedData with cyclic observables in that way. You could build up an objects dict manually, and initialize the SDO from that, something like:

objects = {
    '0': Foo(bar_ref='1'),
    '1': Bar(foo_ref='0')
}
obs = ObservedData(objects=objects)

The _valid_refs stuff might seem a little silly in this case, because it's implied by the dict keys. Anyway, if the goal is to let people work with observables outside of the context of an observed-data SDO, this doesn't help.

Your idea from issue 175 was:

c = NetworkTraffic(src_ref=a, dst_ref=b, protocols='tcp')

where a and b are other observables. This is essentially creating a graph. Problem is, if that NetworkTraffic observable must be immutable after construction, you can't break it apart to create the SDO. At least, not how immutability is conceptualized now (which is basically an immutable dict). But... I think the graph idea is a good one :)

Creating complex immutable objects isn't a new idea. A design I've seen to make it easier is to have a "builder" API. I guess the idea is that these immutable objects aren't so easy to bring into existence all at once, in one fell swoop. Especially those with a lot of inner structure. You probably need to be able to build them up incrementally. That means having some kind of mutable intermediary form. Once you've got everything the way you like it, you invoke a "build" operation, which freezes it all into the final immutable object.

Lucky for us, I don't think we need much of that: Python's dict, list, and other types already provide mutable intermediate forms, the literals make them easy to express in code, and we already support creating SDOs et al from them. But it might make sense to provide some "builder"-support functionality, to make it easier for people to build these structures up before calling stix2.parse() on them. The point of all this is that seen this way, these kinds of functions which operate on plain dicts (or other simple mutable types) and produce plain dicts, might make a lot more sense. They are the "builder" functions, helping users construct new complex objects from several pieces. Maybe most SDOs are simple enough that helpers/builders don't make sense, but observed-data SDOs may be one case where there is enough complexity that it does make sense.

So, here's an idea: we design some functions to operate on dict versions of these observables/SDOs. E.g. we have a function which takes an observable graph (as dicts) and adds the observables to an observed-data dict, or maybe just its "objects" mapping, and handles all the IDs automatically. Dicts in, dicts out. This could be repeated several times for several observable graphs, building up the observed-data objects mapping. When the user is done, they stix2.parse() it, and voila, a complete immutable ObservedData instance.

Regarding your idea of observable reuse: with immutable objects, it's not possible. Each observable could slot into a observed-data SDO in a different way, with different IDs. If you can never change the observable, you can never adapt it to different environments. With plain old mutable dicts, it could be simple.

chisholm avatar May 17 '18 16:05 chisholm

Some further thoughts: although once constructed these objects can't be changed, split apart, etc, you could imagine creating changed or split-apart copies of them. I suppose that way you could design an API to work directly with immutable objects instead of mutable dicts. If you made a copy every time you wanted to change something though, that could become rather inefficient.

The experimental code I wrote was actually destructive to the observable graph parameter: the caller's value was dismantled. I still think it makes sense to do that: it gives callers the opportunity to be efficient. They have a choice; if they want to retain their original value, they can call copy.deepcopy() themselves. It's a one-liner after all, so not a big burden.

Also, I feel like that _valid_refs validation was kind of misplaced. Maybe observable objects should be more naive and not try to know what their surrounding environment is/will be. If we have a more reliable way of creating ObservedData objects and auto-generating references, maybe it's less important to validate them at the observable level. How about letting ObservedData be responsible for validating all its observable references at the time of creation?

chisholm avatar May 17 '18 22:05 chisholm

I like the idea of having a separate "builder" API for ObservedData objects. My suggestion was essentially the same thing, but relying on the fact that the SDO's and Cyber Observables aren't really immutable since they use a dict as an underlying data store. They present an immutable interface, but I was thinking we could encapsulate well-controlled changes to that inner state. But the more I think about that (and type out the previous sentence), that seems like a recipe for disaster.

As of now, Cyber Observables aren't meant to be used outside of Observed Data SDOs (right, @ikiril01?), but that doesn't mean people won't try. What makes it tricky is that Observables use references to refer to other Observables, but the referenced IDs don't exist on the Observables themselves, only within the context of an Observed Data SDO. Yet another point in favor of allowing users to build up graphs of Observables (as dicts) and only validating when passing into the Observed Data structure.

I agree 100% that _valid_refs is poorly placed; this is one reason it's semi-"private"; we didn't like the API but it's the best we could come up with at the time. I like your idea @chisholm, and think it may lead to a better solution.

@clenk, what do you think?

gtback avatar May 23 '18 15:05 gtback

It seems like this issue has split into several threads:

  • an easy way to create a "simple" observed_data objects
  • making a graph of observables (which is difficult because all links are "local", SROs between observed_data are "discouraged" by the spec and I don't think anyone envisioned observable refs to be very complex)
  • dealing with objects that take many steps to create, but are difficult to instantiate because of immutability

But maybe the first point is the only useful one to explore. It seems to me that observed_data should represent simple "points" of data - not relationships between them. I think the idea of having observable refs was that some properties of one cyber_observable can be expressed as a reference to another. No sense having different ways to express the same concept. But the referenced object do not exist on their own - in other words the properties are "owned" by the observed_data SDO - and are not visible outside.

Of course, as you said @gtback, its hard to stop people from trying to make all kinds of crazy complex observables.

Patterns should be used to discover relationships between observed data.

@chisholm, I'd be interested in knowing that your graphs looked like when you were trying to create the network_traffic objects from the httpd logs.

rpiazza avatar May 23 '18 19:05 rpiazza

... Cyber Observables aren't really immutable since they use a dict as an underlying data store. They present an immutable interface ...

The general idea is not crazy (distinguishing between "logical" and "physical" constness); some languages have support specifically for that idea (thinking of C++). Making all observable references transparent would certainly be a big design shift. Big enough that I guess I never considered it to be in-scope. And I think you (Greg) have expressed to me the intention that these "foundational" libraries be a more direct translation from the spec. One could imagine designing SDO APIs that way too. After all, SROs are just edges in a big graph. It's graphs within graphs :)

The dict-based "builder" API idea is a simpler idea which is external to the current core API, so it can help without requiring big changes to the core.

As of now, Cyber Observables aren't meant to be used outside of Observed Data SDOs (right, ikiril01?), but that doesn't mean people won't try.

I don't think this is about supporting crazy unintended usages, it's pragmatic: observed-data SDOs can have some complex substructure, and so they're more complicated to build. Maybe the spec designers didn't think it would usually need to be very complex, but they also didn't prevent it. I don't know if there's enough real-world experience with the idea to know how complicated these SDOs will be in practice, but the potential is there.

I'd be interested in knowing that your graphs looked like when you were trying to create the network_traffic objects from the httpd logs.

What I built for each log line was like a two-atom "molecule": it was a network-traffic observable connected to an ipv4-addr observable via the src_ref property. In fact, I'll just paste a code snip here:

network_traffic = {
    "type": "network-traffic",
    "start": timestamp,
    "src_ref": {
        "type": "ipv4-addr",
        "value": src_ip
    },
    "protocols": ["http"],
    "extensions": {
        "http-request-ext": {
            "request_method": method,
            "request_value": path,
            "request_version": version.lower()
        }
    }
}

Being able to combine both observables into the same dict was very handy :) The logs didn't have a dest IP, but in general, there could also be a third "atom" involved.

It seems like this issue has split into several threads:

I think your second and third bullets are related. One reason an ObservedData SDO might take several steps to create and be difficult to bring into being all at once (bullet 3) is because you have to deal with this observable graph (bullet 2). I think Greg has mentioned that observed-data was intended to contain only "related" observables, not be a dumping ground for anything you might want to throw in. But on the other hand, who's to say what a human analyst might consider "related"? It seems pretty subjective.

chisholm avatar May 24 '18 16:05 chisholm