protovalidate
protovalidate copied to clipboard
[Feature Request] unique repeated messages by field
Feature description: There's already a built in rule for scalar uniqueness in repeated fields. Most repeated messages have an id field that defines uniqueness, that could be used. https://github.com/bufbuild/protovalidate/blob/main/proto/protovalidate/buf/validate/validate.proto#L3199-L3213
Proposed implementation or solution:
This isn't valid CEL, just a rough description of what I'm expecting
optional bool unique = 3 [(priv.field).cel = {
id: "repeated.unique_by_field"
message: "repeated value must contain unique items, identified by an id field"
expression: "this.unique(this.unique_id)"
}];
Contribution: Minimal time to contribute currently
@derekperkins what you're proposing is meant to support this right? (copied from protoc-gen-validate
issue)
Suppose I have an array of messages with field id
.
message Student {
string id = 1;
}
message Class {
// Could your suggestion be used to validate that students[n].id is not repeated within this array?
repeated Student students = 1;
}
Have you found a way to get this behavior with the current API?
Yes, this is exactly the use case I'm suggesting. No, I haven't solved this today with protovalidate, but I'm interested to hear back if you find a good workaround
You may be able to get away with this using what's available in CEL today on Class.students
:
this.map(student, student.id).unique()
This uses the e.map
CEL standard macro and our unique
custom function.
You may be able to get away with this using what's available in CEL today on
Class.students
:
this.map(student, student.id).unique()
This uses the
e.map
CEL standard macro and ourunique
custom function.
I was able to get this to work by specifying the CEL expression on a message, rather than the repeated field. Here is a code sample:
message Environment {
option (buf.validate.message).cel = {
id: "repeated.unique_by_field"
message: "repeated value must contain unique items, identified by the name field"
expression: "this.variables.map(v, v.name).unique()"
};
// Variables to use within the environment. Each variable must have a unique `name` field.
repeated EnvVar variables = 1;
// [(buf.validate.field).cel = {
// id: "repeated.unique_by_field"
// message: "repeated value must contain unique items, identified by the name field"
// expression: "this.map(v, v.name).unique()"
// }];
}
message EnvVar {
// Name of the environment variable. Must be unique.
string name = 1;
// Value of the environment variable.
string value = 2;
}
The buf.validate.message
CEL expression works like a dream, returning the following violation.
"violations": [
{
"fieldPath": "environment",
"constraintId": "repeated.unique_by_field",
"message": "repeated value must contain unique items, identified by the name field"
}
]
However, as you might have noticed. The fieldPath
is not what is wanted in this scenario. While the expression is being ran on environment
, that expression only ever refers to the variables
field, making it more difficult to properly match validation errors to the exact field they are related to.
If we use the buf.validate.field
CEL rule instead of the one on the message, we hit an issue with the way CEL expressions are processed on repeated fields. It seems that when you manually specify a CEL expression on a repeated field, it is processed on each item within the associated array.
compilation error: failed to compile embedded type Environment for UpdateRequest.environment: compilation error: failed to compile expression repeated.unique_by_field: ERROR: \u003cinput\u003e:1:1: expression of type 'EnvVar' cannot be range of a comprehension (must be list, map, or dynamic)
| this.map(v, v.name).unique() | ^
Assuming that the issue preventing the expression from being evaluated on the repeated field itself is resolved, there is just one other issue. Which is handling what field needs to be unique. The constraint ID can be used for localization, but right now it is just a generic repeated.unique_by_field
which doesn't indicate any information regarding what field was used for the unique validation. There are two issues here, first is the fieldPath
not being specific enough on what indexes failed the unique check (and the associated field potentially), and the constraintId not having a proper way of communicating what field the unique check is being ran on. Right now the best solution is to either have your message contain the field name, or if you need proper localization, to add the field name to the constraint key manually. But ideally there should be a proper solution for handling things like this in the future.
After looking at different ways of doing this, could we perhaps have an annotation to mark unique fields:
message Student {
string first_name = 1;
string last_name = 2;
option (buf.validate.message).unique.keys = ["first_name","last_name"];
}
message StudentList {
repeated Student students = 1 [(buf.validate.field).repeated.unique = true];
}
We could set a restriction on the types of the fields in unique.keys. Say it would be simplest if we only allow:
- allow scalar(optional too), Message(provided it has
(buf.validate.message).unique.keys
option) Note: it's very important to allow optional here as it's crucial for many validations where you need to differentiate between missing and default value - do not allow: oneof, repeated, Message without
(buf.validate.message).unique.keys
option
Do you think this could work?
compilation error: failed to compile embedded type Environment for UpdateRequest.environment: compilation error: failed to compile expression repeated.unique_by_field: ERROR: \u003cinput\u003e:1:1: expression of type 'EnvVar' cannot be range of a comprehension (must be list, map, or dynamic)
@matthewpi, I think this is a bug on protovalidate-go's part (likely in the type checking). This should have worked. I'll take a look and see if I can repro and add some tests for it. [Edit: protovalidate-go#80]
There are two issues here, first is the fieldPath not being specific enough on what indexes failed the unique check (and the associated field potentially), and the constraintId not having a proper way of communicating what field the unique check is being ran on.
@matthewpi, there's #17 and #125 which aim to make it easier to get at the underlying info of a violation error. That said, with a custom CEL expression, you have control over the id and error message and can even construct it within the expression to use string formatting against the variables provided:
repeated EnvVar variables = 1 [(buf.validate.field).cel = {
id: "env.vars.unique", // the ID can be very specific to the field in question
expression: "this.map(v, v.name).unique() ? '' : 'variables are expected to be unique, got %s'.format([this.map(v, v.name)])"
}
We intentionally do not include any data in the standard constraint error messages or as part of the violation error as there are security and PII concerns with echoing potentially sensitive data. You are free to do so in custom expressions however, and we're planning to address the linked issues above in the near term.
We could set a restriction on the types of the fields in unique.keys.
@mivanovcpd, a multi-field uniqueness comparison can be problematic for languages that don't have equivalents of Java's equals
and hashCode
overrides. For protobufs in Go, for instance, you cannot use msg1 == msg2
as Go only performs address comparison for pointer types and further the messages contain internal state that may be different between messages even if their fields are identical (so the protobuf library includes a proto.Equal
which is used for comparison). Likewise, Go does not allow overriding the hash function of a type. This means we cannot use a map to make the uniqueness check O(n)
as we'll need to explicitly proto.Equal
each member of the repeated field to the others (which would run in O(n^2)
). Alternatively, especially with a subset of fields, we'd need to write a custom hashing utility explicitly for this purpose if we wanted to maintain an O(n)
runtime.
That said, if we can determine a way to make this universally efficient and not a performance landmine, either in the case of uniqueness across the entire message or a subset of fields (using a FieldMask WKT for this would offer the most flexibility), I'm open to adding such a standard constraint.