vector
vector copied to clipboard
Support validating a message with JSON schema
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Use Cases
We're using vector as a usage event pipeline. We currently define JSON schemas for each of our event types which we can use to validate events at write time. However, we've been considering using Vector as the entrypoint for some new event types (creating events from an S3 bucket which uses a different format). For this case, it would be nice to transform the data into the correct shape, and then have the pipeline validate it against a schema to ensure it's correct.
Attempted Solutions
So far we haven't attempted anything, but if first-class support is not added, I think I will attempt to write a tool that generates a VRL script to validate an event against a JSON schema.
Proposal
- A new global configuration which defines where to load JSON schemas from (similar to
enrichment_tables) - A new VRL function to validate a value against a JSON schema by name
For example, the global configuration may look like:
schemas:
my_schema:
type: json_schema
file_path: /vector-config/schemas/my_schema.json
And the VRL function usage might look like:
is_valid, err = validate_schema(., 'my_schema')
References
No response
Version
No response
This would be extremely helpful. I have the same use case.
Instead of a VRL function, we could easily add a new json_schema condition type. That would work for the common use-cases:
- Conditions can be used in tests. (Most of my test conditions could be replaced with JSON schemas.)
- Conditions can be used in transforms. In particular, the
routetransform allows you to handle unmatched events and has out-of-the-box metrics for matched/unmatched events, too.
I agree that would be a good solution. I'm picturing the syntax as something like:
my_condition:
type: json_schema
# Property path to validate. Optional, defaults to `.`
path: .nested.property
# The JSON schema
schema:
type: object
properties:
alpha:
type: string
That said, one advantage of supporting this in VRL is that it would significantly reduce the number of transforms needed for a pipeline like mine. With the VRL approach, I could have a single remap transform which checks the event type from the object, then validates it against the appropriate schema. However, with the condition approach, I would need an initial route transform which checks the event type and fans out to individual validation route transforms for each event type.
Using a VRL JSON schema check:
flowchart TD
source --> verify_all_event_types -->|fail| verify_all_event_types._unmatched
verify_all_event_types._unmatched --> dlq_sink
verify_all_event_types --->|pass| valid_sink_1 & valid_sink_2
Using a condition JSON schema check:
flowchart TD
source --> route_by_event_type
route_by_event_type -->|event_type is 1| verify_event_type_1 -->|fail| verify_event_type_1._unmatched
route_by_event_type -->|event_type is 2| verify_event_type_2 -->|fail| verify_event_type_2._unmatched
route_by_event_type -->|event_type is 3| verify_event_type_3 -->|fail| verify_event_type_3._unmatched
verify_event_type_1._unmatched --> dlq_sink
verify_event_type_2._unmatched --> dlq_sink
verify_event_type_3._unmatched --> dlq_sink
verify_event_type_1 --->|pass| valid_sink_1 & valid_sink_2
verify_event_type_2 --->|pass| valid_sink_1 & valid_sink_2
verify_event_type_3 --->|pass| valid_sink_1 & valid_sink_2
It does sound like it would be easier to implement the condition though. Maybe we could start with that and consider implementing the VRL check later?