protovalidate icon indicating copy to clipboard operation
protovalidate copied to clipboard

[Feature Request] unique repeated messages by field

Open derekperkins opened this issue 1 year ago • 6 comments

Feature description: There's already a built in rule for scalar uniqueness in repeated fields. Most repeated messages have an id field that defines uniqueness, that could be used. https://github.com/bufbuild/protovalidate/blob/main/proto/protovalidate/buf/validate/validate.proto#L3199-L3213

Proposed implementation or solution:

This isn't valid CEL, just a rough description of what I'm expecting

  optional bool unique = 3 [(priv.field).cel = {
    id: "repeated.unique_by_field"
    message: "repeated value must contain unique items, identified by an id field"
    expression: "this.unique(this.unique_id)"
  }];

Contribution: Minimal time to contribute currently

derekperkins avatar Sep 18 '23 20:09 derekperkins

@derekperkins what you're proposing is meant to support this right? (copied from protoc-gen-validate issue)

Suppose I have an array of messages with field id.

message Student {
  string id = 1;
}

message Class {
 // Could your suggestion be used to validate that students[n].id is not repeated within this array?
  repeated Student students = 1; 
}

Have you found a way to get this behavior with the current API?

RoxKilly avatar Oct 18 '23 14:10 RoxKilly

Yes, this is exactly the use case I'm suggesting. No, I haven't solved this today with protovalidate, but I'm interested to hear back if you find a good workaround

derekperkins avatar Oct 18 '23 16:10 derekperkins

You may be able to get away with this using what's available in CEL today on Class.students:

this.map(student, student.id).unique()

This uses the e.map CEL standard macro and our unique custom function.

rodaine avatar Oct 30 '23 19:10 rodaine

You may be able to get away with this using what's available in CEL today on Class.students:

this.map(student, student.id).unique()

This uses the e.map CEL standard macro and our unique custom function.

I was able to get this to work by specifying the CEL expression on a message, rather than the repeated field. Here is a code sample:

message Environment {
	option (buf.validate.message).cel = {
		id: "repeated.unique_by_field"
		message: "repeated value must contain unique items, identified by the name field"
		expression: "this.variables.map(v, v.name).unique()"
	};

	// Variables to use within the environment. Each variable must have a unique `name` field.
	repeated EnvVar variables = 1;
//	[(buf.validate.field).cel = {
//		id: "repeated.unique_by_field"
//		message: "repeated value must contain unique items, identified by the name field"
//		expression: "this.map(v, v.name).unique()"
//	}];
}

message EnvVar {
	// Name of the environment variable. Must be unique.
	string name = 1;
	// Value of the environment variable.
	string value = 2;
}

The buf.validate.message CEL expression works like a dream, returning the following violation.

"violations": [
	{
		"fieldPath": "environment",
		"constraintId": "repeated.unique_by_field",
		"message": "repeated value must contain unique items, identified by the name field"
	}
]

However, as you might have noticed. The fieldPath is not what is wanted in this scenario. While the expression is being ran on environment, that expression only ever refers to the variables field, making it more difficult to properly match validation errors to the exact field they are related to.

If we use the buf.validate.field CEL rule instead of the one on the message, we hit an issue with the way CEL expressions are processed on repeated fields. It seems that when you manually specify a CEL expression on a repeated field, it is processed on each item within the associated array.

compilation error: failed to compile embedded type Environment for UpdateRequest.environment: compilation error: failed to compile expression repeated.unique_by_field: ERROR: \u003cinput\u003e:1:1: expression of type 'EnvVar' cannot be range of a comprehension (must be list, map, or dynamic)

| this.map(v, v.name).unique() | ^

Assuming that the issue preventing the expression from being evaluated on the repeated field itself is resolved, there is just one other issue. Which is handling what field needs to be unique. The constraint ID can be used for localization, but right now it is just a generic repeated.unique_by_field which doesn't indicate any information regarding what field was used for the unique validation. There are two issues here, first is the fieldPath not being specific enough on what indexes failed the unique check (and the associated field potentially), and the constraintId not having a proper way of communicating what field the unique check is being ran on. Right now the best solution is to either have your message contain the field name, or if you need proper localization, to add the field name to the constraint key manually. But ideally there should be a proper solution for handling things like this in the future.

matthewpi avatar Nov 15 '23 20:11 matthewpi

After looking at different ways of doing this, could we perhaps have an annotation to mark unique fields:

message Student {
  string first_name = 1;
  string last_name = 2;
  option (buf.validate.message).unique.keys = ["first_name","last_name"];
}
message StudentList {
  repeated Student students = 1 [(buf.validate.field).repeated.unique = true];
}

We could set a restriction on the types of the fields in unique.keys. Say it would be simplest if we only allow:

  • allow scalar(optional too), Message(provided it has (buf.validate.message).unique.keys option) Note: it's very important to allow optional here as it's crucial for many validations where you need to differentiate between missing and default value
  • do not allow: oneof, repeated, Message without (buf.validate.message).unique.keys option

Do you think this could work?

mivanovcpd avatar Nov 16 '23 00:11 mivanovcpd

compilation error: failed to compile embedded type Environment for UpdateRequest.environment: compilation error: failed to compile expression repeated.unique_by_field: ERROR: \u003cinput\u003e:1:1: expression of type 'EnvVar' cannot be range of a comprehension (must be list, map, or dynamic)

@matthewpi, I think this is a bug on protovalidate-go's part (likely in the type checking). This should have worked. I'll take a look and see if I can repro and add some tests for it. [Edit: protovalidate-go#80]


There are two issues here, first is the fieldPath not being specific enough on what indexes failed the unique check (and the associated field potentially), and the constraintId not having a proper way of communicating what field the unique check is being ran on.

@matthewpi, there's #17 and #125 which aim to make it easier to get at the underlying info of a violation error. That said, with a custom CEL expression, you have control over the id and error message and can even construct it within the expression to use string formatting against the variables provided:


repeated EnvVar variables = 1 [(buf.validate.field).cel = {
  id: "env.vars.unique", // the ID can be very specific to the field in question
  expression: "this.map(v, v.name).unique() ? '' : 'variables are expected to be unique, got %s'.format([this.map(v, v.name)])"
}

We intentionally do not include any data in the standard constraint error messages or as part of the violation error as there are security and PII concerns with echoing potentially sensitive data. You are free to do so in custom expressions however, and we're planning to address the linked issues above in the near term.


We could set a restriction on the types of the fields in unique.keys.

@mivanovcpd, a multi-field uniqueness comparison can be problematic for languages that don't have equivalents of Java's equals and hashCode overrides. For protobufs in Go, for instance, you cannot use msg1 == msg2 as Go only performs address comparison for pointer types and further the messages contain internal state that may be different between messages even if their fields are identical (so the protobuf library includes a proto.Equal which is used for comparison). Likewise, Go does not allow overriding the hash function of a type. This means we cannot use a map to make the uniqueness check O(n) as we'll need to explicitly proto.Equal each member of the repeated field to the others (which would run in O(n^2)). Alternatively, especially with a subset of fields, we'd need to write a custom hashing utility explicitly for this purpose if we wanted to maintain an O(n) runtime.

That said, if we can determine a way to make this universally efficient and not a performance landmine, either in the case of uniqueness across the entire message or a subset of fields (using a FieldMask WKT for this would offer the most flexibility), I'm open to adding such a standard constraint.

rodaine avatar Nov 16 '23 17:11 rodaine