malli icon indicating copy to clipboard operation
malli copied to clipboard

re schema doesn't properly validate values with line breaks

Open lasiltan opened this issue 2 years ago • 6 comments

(malli.core/validate [:re #"^thing$"] "thing\n")
=> true

Compare to

(re-matches #"^thing$" "thing\n")
=> nil

lasiltan avatar Mar 06 '23 07:03 lasiltan

This ultimately boils down to the difference in java.util.regex.Matcher methods find and matches (clojure.core/re-findvs. clojure.core/re-matches).

(let [pattern (Pattern/compile "^thing$")
      matcher (.matcher pattern "thing\n")]
  (.matches matcher))
=> false

(let [pattern (Pattern/compile "^thing$")
      matcher (.matcher pattern "thing\n")]
  (.find matcher))
=> true

The javadoc on java.util.regex.Matcher/find states:

     * @return  {@code true} if, and only if, a subsequence of the input
     *          sequence matches this matcher's pattern

whereas the javadoc for java.util.regex.Matcher/matches states:

Attempts to match the entire region against the pattern.

To me, at least, the functionality of java.util.regex.Matcher/matches, i.e clojure.core/re-matches, is what I'm expecting.

lasiltan avatar Mar 07 '23 15:03 lasiltan

Interesting. Tested with few alternatives:

Plumatic

(require '[schema.core :as schema])
(schema/validate #"thing" "thing\n")
; => "thing\n" (match)

JSON Schema Spec

{"type": "string", "pattern": "^thing$"}
// => null (match)

Javascript

RegExp('^thing$').exec('thing\n')
// => null (no match)

So, Malli works currently like Plumatic and JSON Schema do, but not like default JavaScript.

ikitommi avatar Mar 09 '23 07:03 ikitommi

Do you have more external references (on languages / libraries etc) on what is the correct way to handle this? Options:

  1. keep the current way as others (JSON Schema + Plumatic) are doing it like this
  2. change the default as it's obviously less correct, need to handle JSON Schema mapping
  3. new option to m/-re-schema to override this, e.g. users can configure this

ikitommi avatar Mar 09 '23 08:03 ikitommi

I think the behavior for re-find is definitely a little sneaky with it allowing a newline at the end, but I think using re-find behavior is more intuitive than re-matches behavior.

For example if I have the schema

[:re #"\dcans"]

I would expect that to match any string that contained <num>cans anywhere in the string, e.g. one can, 2cans, 3cans or something like that. If I wanted my string to start with <num>cans, then I'd use ^. re-matches behavior is basically forcing ^ and $ semantics when sometimes that's not what you want. You can always opt in to that behavior by using ^ and $.

camsaul avatar Apr 14 '23 01:04 camsaul

I should also mention that you can always use \z instead of $ if you actually want the end of the input, $ is technically The end of a line and \z is the end of the input.

(re-find "^string\z" "string\n") => nil

See https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html

camsaul avatar Apr 14 '23 01:04 camsaul

I didn't even know about \z. I think that's good enough for me.

lasiltan avatar Apr 12 '24 06:04 lasiltan