logstash-logback-encoder icon indicating copy to clipboard operation
logstash-logback-encoder copied to clipboard

RFE: Schema support

Open jhalliday opened this issue 8 years ago • 11 comments

As a data platform engineer, I wish to:

  • have the logger generate a JSON schema for its output, so that e.g. a typesafe API can be generated for working with the log data in other tools.
  • provide a JSON schema file and have the logger validate that output conforms to the schema and throw a configuration error if it does not.

To facilitate this, it would help if the various JsonProvider classes were self-describing, e.g. add a method to the JsonProvider interface to get the field names and corresponding types. For the majority of providers this is trivial, though for those where the schema depends on the the runtime logging event content, e.g. ArgumentsJsonProvider, it's a bit more involved.

I've prototyped this by subclassing the Providers and Encoder, an approach which I can continue to use if necessary. However, it would be cleaner to push the additions upstream.

Does the community consider this feature would be generally useful?

jhalliday avatar Jul 28 '16 10:07 jhalliday

Interesting idea.

Looking at the existing providers for logging events, the following could contribute to a static schema, or validate a given schema at startup:

  • timestamp
  • version
  • message
  • rawMessage
  • loggerName
  • threadName
  • logLevel
  • logLevelValue
  • callerData
  • stackTrace
  • contextName
  • nestedField

And these providers could not, since their values are dynamic:

  • context (may be able to, if it only reads the context once at startup)
  • mdc
  • tags
  • logstashMarkers
  • pattern (since it can include mdc)
  • arguments

How do you propose handling the dynamic providers?

philsttr avatar Jul 29 '16 04:07 philsttr

The current iteration of the prototype uses two different interfaces. The static version is simply SchemaData getSchema(); whilst the dynamic version is

SchemaData getSchema(ILoggingEvent evt);
boolean checkConformance(ILoggingEvent evt, SchemaData schema);

so in 'please generate a schema' mode with dynamic elements, the schema generation is deferred until the first logged event rather than done at init time. In validating mode each event is tested against the schema as it passes though the Encoder's doEncode method. Using two separate interfaces rather than adding methods to JsonProvider also makes it backwards compatible - pre-existing Providers need to be retrofitted only if they are used with the schema features and will continue to work just fine in environments that don't enable that feature. Useful for not breaking 3rd party extensions.

jhalliday avatar Jul 29 '16 08:07 jhalliday

in 'please generate a schema' mode with dynamic elements, the schema generation is deferred until the first logged event rather than done at init time

Hmm. This won't really work consistently, since each event could have their own dynamic content. i.e. Not all events would be guaranteed to follow the schema generated by the first logged event.

I'm hesitant to include this mode if any dynamic providers are configured, since the schema is not deterministic, and is not guaranteed to be followed by all events. It could potentially be supported for the static providers only. Maybe throw an initialization error if any dynamic providers are used. But since most of the power of this library is in the dynamic providers, apps would lose a lot of the functionality provided by this library.

After thinking about this a little more, I'm also a little hesitant to include the schema validation mode as well. IMHO, schema validation would be better handled downstream (i.e. in logstash), since I would think you would want to perform this validation downstream in the centralized logging pipeline (i.e. not at the start of the stream). For example, if you have 1000 apps/jvms configured to send logs, I'd want to implement schema validation in one place (somewhere downstream in the logging processing pipeline), not in 1000 places. And since schema validation introduces processing overhead, having it done in the pipeline would allow you to horizontally scale the pipeline according to your log volume, without impacting performance of the upstream applications.

philsttr avatar Jul 31 '16 23:07 philsttr

since each event could have their own dynamic content. i.e. Not all events would be guaranteed to follow the schema generated by the first logged event.

yes, that's kind of the point. I want to enforce schema compliance, stopping records that are not correctly formed. Naturally the schema has to be general enough to cover the allowed polymorphism of records. There are two approaches I've been experimenting with here: a larger number of loggers each with their own more specific schema, or an Encoder which can select between multiple schema based on some characteristic of the logging event, such as the value of a specific field such as a message string.

I'm hesitant to include this mode if any dynamic providers are configured

sure, the Encoder could be configured with various modes (NO_SCHEMA, STATIC, DYNAMIC, ...), or choose a different Encoder e.g. StaticSchemaEnforcingEncoder, depending on the desired behavior. Not everyone will want to have schema support, or will want static only but not dynamic. But as long as they don't get in the way when not used, why not provide dynamic schema information for those that do want it?

I'm also a little hesitant to include the schema validation mode as well.

In a data routing fabric of N sources and M sinks, it's sometimes beneficial to enforce schema at source than force all sinks to validate it. By catching the problem early, you avoid the overhead of serializing, transporting and storing useless information. You're also in a better position to provide feedback on the problems at source - the eventual consumers don't necessarily know how to route that feedback to the generating application. There is no centralised place to handle schema in some architectures, that's a simplified use case that ignores the usage of JSON without logstash. The library design couples the encoding format to the specific consumer, which limits its utility. I'm using it more as a generic logback-json-encoder that may or may not be feeding logstash.

jhalliday avatar Aug 02 '16 10:08 jhalliday

Regarding schema validation mode, you've sold me. I'm willing to accept a PR for this if you want to submit one.

However, schema generation mode is still a little unclear to me. I still don't quite understand how this works with dynamic providers.

Naturally the schema has to be general enough to cover the allowed polymorphism of records.

The dynamic providers can insert any information into the output JSON event. So, the generated schema would need to allow any field in the root JSON event. If the schema allows any field in the root JSON event, then everything will pass schema validation. And there's really no point to having a schema that allows anything.

I want to enforce schema compliance, stopping records that are not correctly formed.

I'm still not sold on generating a schema based on the first event logged. Since the first event logged can be non-deterministic, it just seems wrong to base a schema on it. For example, if an application starts up one time, and the first message logged has one schema, but the second message logged has a different schema, who is to say which schema is the correct one? What happens if the next time the application starts, the messages were logged in the opposite order. Is the schema for the first event logged still correct? This can result in different schemas every time an application starts. The non-deterministic behavior of this is what I don't like.

philsttr avatar Aug 13 '16 06:08 philsttr

The polymorphism permitted by the application requirements is usually a very narrow subset of that permitted by the logging API. For many use cases it's further constrained by utilizing a 'logger per event type' model rather than a 'logger per class/package' model. When using the system as a source into an event routing fabric, we have different loggers for different event types, since they wire to different queues/topics and from there to different groups of sinks. Hence

salesLogger.log("itemSoldEvent", append("item", myItemBean));

is valid, whilst

salesLogger.log("itemSoldEvent", append("item", someCustomerBean)); customerAccountLogger.log("itemSoldEvent", append("item", myItemBean));

are bugs in the code that the loosely typed api won't catch, although

customerAccountLogger.log("userRegisteredEvent", append("customer", someCustomerBean));

is fine. For loggers that permit only one event type, a highly constrained schema is possible, taking the place of a strongly typed API.

A more complicated case is queues that permit multiple distinct event types, such as

vacationBookingsLogger.log("itemAddedEvent", append("flight", airlineFlightBean)); vacationBookingsLogger.log("itemRemovedEvent", append("accomodation", hotelBookingCancellationBean));

these cases are less common - usually we decompose to separate queues. However, it's sometimes necessary to preserve strict ordering, which rules out that option. In such cases it's desirable to apply a 'schema per key' pattern, where a specified value from the call, usually the first parameter, is used to select the schema to be applied. The config thus becomes:

or suchlike. I'm not wild about such cases - they tend towards making generation of a strongly typed Java API more desirable, since the compiler will then barf helpfully if we try e.g.

customerAccountLogger.logUserRegisteredEvent(itemBean);

But such cases tend to be in the minority. Also, a generated API won't catch issues with fields that are added inside the logging library layer, such as a timestamp. For those the logging config file needs to be checked against the schema to catch errors where the config file rather than the Java code is incorrect.

In schema generation mode, it not so meaningful to say a message is wrong, only that multiple messages aren't consistent with each other. Schema generation mode allows us to avoid generating inconsistent messages. To avoid generating ones that are outright broken, we need schema validation mode. It may be that schema generation is more useful at build time to save us maintaining a schema by hand - easy for it to get out of sync with the app code.

There are two approaches to the problem of 'I need to consume this queue of events': look at a sample of them and derive the schema from it, which is time consuming duplicate effort for N consumers and risks missing e.g. fields that may occur but which don't appear in the sample messages, or have range bounds. Or, have the schema made available by the source application. From the consumer's point of view it doesn't matter how that schema is made available: the app's authors can hand tool it if they like, but that's tedious and error prone compared to having the tooling spit it out. Right now it's hard to look at a logging config file and figure out the schema that messages will use, even for the 100% static case. If the JsonProvider classes are at least somewhat self-describing, it becomes easier to automate.

This problem is mirrored inside the app itself: schema validation can take the 'here is a sample message and a schema, does the message comply to the schema' approach, in which case JsonProvider impls don't need to be schema aware but we pay a per-message runtime cost, or it can take a 'will this config always generate messages consistent with this schema' approach, which is a fixed one-off cost at startup but works only for static providers. For configs which include dynamic providers we take a hybrid approach, using the cheap method for static providers and the more expensive one for dynamics.

Because of this, most of that code needed for schema generation is needed for schema validation too. If providers are self-describing and there is a schema checking Encoder, then schema generation for the static case is really just a variation on that Encoder which spits out the schema made by interrogating the configured JsonProviders, rather than comparing it to a provided schema. I think that's worth doing at least. Extending generation to deal with the dynamic case is optional, but let's return to the straw-man API:

SchemaData getSchema(ILoggingEvent evt); boolean checkConformance(ILoggingEvent evt, SchemaData schema);

The getSchema call is a one-time cost, payable on first call rather than at startup. It's going to allocate objects and potentially parse bean classes to generate schema for them, as with the a append("key", myBean) examples above. In short, it's expensive. That's why the SchemaData is cached and allows for a fastpath non-allocating comparison call on subsequent log events. Given that SchemaData is there anyhow for schema validation mode, the changes needed to support schema generation mode with dynamic providers don't actually touch the providers, only the Encoder. Since that's the case it is relatively easy to maintain as a separate entity, probably a subclass of Encoder. So I'm not too concerned if you don't want that upstream. It's the JsonProvider changes to implement that API which are painful to maintain independently, since there are a whole bunch of classes with varying internals.

jhalliday avatar Aug 15 '16 10:08 jhalliday

Feel free to submit a PR. I'd like to take a look at the implementation so that I can get a better understanding of how schema generation mode looks. But again, I'm definitely on board with schema validation mode.

You have precisely mentioned my concern about generating schema for dynamic providers directly here:

look at a sample of them and derive the schema from it, which is time consuming duplicate effort for N consumers and risks missing e.g. fields that may occur but which don't appear in the sample messages, or have range bounds

My concern is that generating a schema using the first event logged will "risk missing e.g. fields that may occur but which don't appear in the sample messages, or have range bounds"

So, I'm still hesitant about schema generation mode, but I'd like to at least take a look.

If schema generation mode is not accepted, perhaps we can come up with some modifications that will allow you to more easily implement schema generation mode fully yourself. e.g. implement callbacks, hooks, etc

philsttr avatar Aug 28 '16 23:08 philsttr

ok, preliminary version for the the static-only case is at https://github.com/jhalliday/logstash-logback-encoder/tree/schemas It's easy to see that changes for flipping from schema validation to schema generation would be confined to the SchemaPlugin implementation, which should allow easy 3rd party extension even if you don't want that functionality in the upstream. I'll layer on the dynamic providers bit once we're done with code review for this part.

jhalliday avatar Sep 06 '16 14:09 jhalliday

(Sorry for the delay. I'll try to be quicker in the future)

Looks pretty good so far. I added some minor comments inline on the commit in your repo

I'm looking forward to seeing how the dynamic providers are supported.

philsttr avatar Oct 09 '16 23:10 philsttr

Did this enhancement got stalled?

Is there some other suggested alternative now?

littleapp avatar Jun 13 '18 10:06 littleapp

@littleapp Yes, stalled.

The only approach right now is to perform schema validation on the destination side.

philsttr avatar Jun 19 '18 03:06 philsttr