protovalidate icon indicating copy to clipboard operation
protovalidate copied to clipboard

[BUG] `string.well_known_regex` cannot handle binary HTTP header values

Open ash2k opened this issue 1 year ago • 2 comments

Description

well_known_regex is on a string type. string must be a valid UTF-8 encoded Unicode code point sequence. But HTTP header values don't have to be. I.e. valid HTTP header values cannot be represented as a proto string. Yet the validation rule is on the string type.

Steps to Reproduce

message HeaderValues {
  string val = 1 [(buf.validate.field).string.well_known_regex = KNOWN_REGEX_HTTP_HEADER_VALUE];
}
func TestAbc(t *testing.T) {
	val := []byte("\xff")
	valid := utf8.Valid(val)
	assert.True(t, valid, "Invalid UTF-8")

	msg := &HeaderValues{
		Val: "\xff",
	}
	_, err := proto.Marshal(msg)
	assert.NoError(t, err, "proto.Marshal() failed")

	v, err := protovalidate.New()
	require.NoError(t, err)
	err = v.Validate(msg)
	assert.NoError(t, err)
}

Prints:

=== RUN   TestAbc
    prototool_test.go:16: 
        	Error Trace:	file_test.go:16
        	Error:      	Should be true
        	Test:       	TestAbc
        	Messages:   	Invalid UTF-8
    prototool_test.go:27: 
        	Error Trace:	file_test.go:27
        	Error:      	Received unexpected error:
        	            	string field contains invalid UTF-8
        	Test:       	TestAbc
        	Messages:   	proto.Marshal() failed
--- FAIL: TestAbc (0.02s)

FAIL

Process finished with the exit code 1

Expected Behavior

Validation rule should be on bytes. At least there too, not just on string.

Actual Behavior

Validation rule on string that is problematic.

Possible Solution

Additional Context

HTTP header value spec: https://datatracker.ietf.org/doc/html/rfc9110#name-field-values

ash2k avatar Oct 14 '24 21:10 ash2k

See original provenance of these regex here: https://github.com/bufbuild/protoc-gen-validate/pull/297

I'd actually rather see these removed from protovalidate, and users can provide predefined constraints that match the semantics they want (according to the thread above, the RFC recommends only using ASCII).

rodaine avatar Oct 16 '24 17:10 rodaine

As I explained in the description, I think the regexes the way they are right now cannot be used. So I agree something needs to change. To be clear, this is not a problem for me, I just wanted to report the issue I noticed.

ash2k avatar Oct 17 '24 07:10 ash2k