vector
vector copied to clipboard
Event API RFC
We need an RFC for the design of the event module. It has gone through a lot of organic, incremental changes that have left it in a state that only makes sense by taking into account the previous states. Since this is such a core part of Vector, we need to take the time to do some real design and ensure that the interfaces encourage efficient bug-free use.
In particular, we should look to address the following issues:
- Contributor confusion around flattened vs nested APIs (e.g. #2410)
- Performance of those commonly used APIs (e.g. #2334, #2104)
- Compatibility with WASM-based FFI (e.g. #1891)
- Contributor ergonomics around the global log schema (e.g. would separate metadata simplify implementation?)
- User ergonomics for specifying nested keys (e.g. #2340, #2339, #2069, #2066, #2338, #2400)
- User confusion around "metadata" that we add internally (e.g. https://github.com/timberio/vector/issues/2398#issuecomment-617559933, #1448)
Related to https://github.com/timberio/vector/pull/2692
@Hoverbear this was closed by #2692, right?
Dumping thoughts about the Event, LogEvent and Value APIs that would've helped me out whilst implementing remap. We currently have three degrees of separation between an Event and the underlying data:
Event(enum) -> LogEvent(struct) -> Value
A LogEvent type simply wraps an IndexMap, and is pretty much the same as a Value with a forced map root. However, the API for a LogEvent is separate to Value, so if we add useful functionality to LogEvent we will miss out on those APIs on Value types, and the same applies visa-versa.
From what I can tell the rationality for having a LogEvent type is simply to enforce IndexMap at the root of a log, which is fair enough but I don't see why the query APIs for LogEvent and Value should be separate. It would instead make sense to me for LogEvent to be stripped down and add useful get_path and insert_path APIs to Value exclusively, which are exposed globally.
Otherwise any function that recursively walks a Value needs to be implemented separately to whatever works on the root LogEvent, even if the functionality is the exact same, yuck!
Also, the RFC currently lists plans for Path that seem sensible to me, just wanted to mention that for my purposes it's important to be able to form a path from pre-parsed path segments, as my parser already extracts escaped segments and array indexes so I'd be looking for something that looks like this: val.get_path(Path::key(foo)).unwrap().get_path(Path::index(bar)).
It'd also be nice to be able to construct a log event with multiple fields in one call, something like:
Event::from(vec![
("foo".into(), Value::from("first value")),
("bar.baz".into(), Value::Boolean(true)),
])
It's annoying for testing purposes to always have a timestamp field injected but I can live with that
@Hoverbear bump. Can you close this if https://github.com/timberio/vector/pull/2692 covers it? If not, can you note what’s left and which parts of @Jeffail’s feedback is also not covered?
I believe #2692 covers the usability side, we might want to invest more time into a performance work enabled by these new usability enhancements after.
Points #2692 does not cover which this speaks to:
- Contributor ergonomics around the global log schema (e.g. would separate metadata simplify implementation?)
- User confusion around "metadata" that we add internally (e.g. #2398 (comment), #1448)
I spoke with @lukesteensen about this yesterday. I don't really think this warrants a full follow up RFC after #4066.
I do think that two RFCs could to follow:
- @bruceg and I spoke prior about adding metadata to events. This will need an RFC to fully define how it should work.
- The log schema is it's own problem entirely, and needs it's own rethink.
Let me know if you want to chat or flesh out a full RFC about either or both!
Metadata
Observing (post #4066):
https://github.com/timberio/vector/blob/44bc2c9844abac900bebe1d6eee84e8eab51c8ca/lib/shared/src/event/log_event/mod.rs#L75-L82
https://github.com/timberio/vector/blob/8e676aa5eb12c7a43ca790da56c16dc7629c97ad/lib/shared/src/event/metric/mod.rs#L17-L28
It should be possible to add a metadata field to both, then have metadata_get() calls or something. The Event type can proxy these calls into it's contained metric/log/trace.
Metadata likely could benefit from some special considerations:
- Metadata may be desirable to have "write once" characteristics, similar to the tracing Record and it's visitors: https://docs.rs/tracing/0.1.22/tracing/span/struct.Record.html
- Metadata should have
#[serde(skip)](https://serde.rs/field-attrs.html#skip) or otherwise skip serialization. This should further cement that metadata is for internal use, as it's not user facing. - I do not think user-script like remap or lua should be able to manipulate metadata, only read it.
- Perhaps it would be beneficial to use the borrow checker to limit components to specific subdomains (eg
metadata.transforms.$TRANSFORM_NAME) so that devs/users can have reliable semantics/ux. - We should restrict schema data to be fully structured and only use refs (like &str), primitives, or lazy statics. Both to keep the size per-event down, and to keep this data on the stack whenever possible.
- It may be beneficial to have each component have an initialized default
metadatasubdomain (that will likely not change) and for events to holdCowvalues of them, allowing events to often hold just pointers to CoW refs.
There are numerous uses for this data. As @bruceg and I discussed previously this is useful for E2E latency, as well as E2E acknowledgement, diagnostics/tracing, and debugging.
I also note that if we allow configs to have a schema in the metadata we could unlock the ability to support global schema changes at without SIGHUP. The topology holding the config could swap out the schema that's being written by the components.
Schemas
We hold the global log schema here:
https://github.com/timberio/vector/blob/227e494d408fa65f07de517323dc516a4dfd4533/src/config/log_schema.rs#L7-L17
It's a crutch, or a crux if you prefer! This should totally be held by the topology and passed around safely throughout Vector so that we could safely alter it without SIGHUPs.
We should fix this. I think the method enabled by Metadata (as described above) is probably the best, but otherwise, we should be passing these schemas along during topology building, and consider interfaces to allow dynamic updating if desired.
Additionally, the current LogSchema is pretty much a map of aliases right now:
https://github.com/timberio/vector/blob/227e494d408fa65f07de517323dc516a4dfd4533/src/config/log_schema.rs#L19-L30
Typically I think of schemas to include data types. It might be worthwhile to use this structure as something like a coercer or a data shaper before being serialized, or during deserialization (to enable enrichment, not reject bad values).
These schemas would likely be closely involved and related with any future idea of a "Codec".
Around Event API Performance:
- We should investigate https://docs.rs/smallvec/1.5.1/smallvec/ or https://docs.rs/dynstack/0.4.0/dynstack/ to fit in lookups and some of our Event API. Our (opt-in since #4066) is unfortunately heavy on heap allocation.
- We should consider if it's possible to use CoW events during Fanout, I suspect many fanout applications do not need to do much modification. I'm kind of iffy on this idea.
- Furthering that, we might consider CoW values.
- There is a compatibility shim between
remap_lang::Pathandshared::Lookup(Buf). It's slow as it has to do string operations. The two types have slightly different semantics and based on my discussions with @FungusHumungus and @JeanMertz I understand thatremap_lang::Pathwill be deprecated post #4066. - I do not think most of our protobuf code is ideal, and this code is often on the hotpath due to buffering. Observe: https://github.com/timberio/vector/blob/76030efcb0ab14a590776be7b7f940a50af262b6/src/event/mod.rs#L64-L145 https://github.com/timberio/vector/blob/76030efcb0ab14a590776be7b7f940a50af262b6/proto/event.proto#L7-L62 There is a lot of extra copying and duplicating here. We should be directly serializing and deserializing to the protobuf type, not using a proxy. We can benefit from using something like https://github.com/danburkert/prost#serializing-existing-types perhaps, or optimizing these implementations in some way.
Closing since much of this work seems to have been completed. We'll let anything else bubble back up.