graal icon indicating copy to clipboard operation
graal copied to clipboard

[GR-49770] Restrict the Expressivity of Patterns in `resource-config.json`

Open vjovanov opened this issue 1 year ago • 10 comments

TL;DR

Native Image currently accepts Java regular expressions as file patterns in resource-config.json. This is problematic for two reasons:

  1. It is error-prone as users often miss simple patterns in resources: The errors are common due to wrongly written (or missing) \\Q and \\E characters.
  2. It is computationally demanding to check if a query that does not hit a resource should throw a MissingResourceRegistrationError or return a null. With regular expressions, it is necessary to check all regular expressions in the build against the current resource path.

In this proposal, we restrict the resource-matching patterns to a subset of the glob patterns to simplify pattern matching and reduce computational complexity. The patterns that would be supported are:

  1. The * pattern that matches any number of characters including none within the current directory.
  2. The ** as a name component to recursively match any number of layers of directories.

Proposed patterns allow users to write queries to bulk-include resources, for example, assets/**/* or images/**/*.png. However, they can be canonicalized into a tree-like structure that allows fast checking of negative queries at run time.

The new format will allow only the inclusion patterns:

{
  "resources": [
      {
        "glob": "assets/**"
      },
      {
        "module": "myJavaModule",
        "glob": "images/**/*.png"
      }
  ]
}

Goals

  1. Simplify resource patterns to make writing of resource patterns easier.
  2. Reduce the cost of run-time checks for negative resource queries.

Non-Goals

  1. Make resource-pattern writing harder.
  2. Disallow matching multiple resources with one entry in resource-config.json.

Backward Compatibility

The new format will be output by the agent and described in all documentation as the default. The reachability metadata will be migrated to the new format immediately. The previous versions of GraalVM must be modified to accept the new format as well.

The current format for resources will be supported until the majority of the ecosystem adopts the new format.

vjovanov avatar Sep 24 '23 19:09 vjovanov

I support the idea of using globbing instead of allowing arbitrary regex for resource inclusion.

However, we should ensure that this new format also allows per-module resource inclusion like the current format. See: https://github.com/oracle/graal/blob/5cfb880702ed1b4bfc1d25f9d6a570d137f3fd05/docs/reference-manual/native-image/Resources.md?plain=1#L102

This new format give us the opportunity to have a dedicated field for per-module resource inclusion instead of encoding an optional module-name prefix into the pattern name like we did before.

olpaw avatar Nov 06 '23 15:11 olpaw

Also we should not break compatibility with the existing regex based solution but instead deprecate it. I.e. whenever we see a resource-config.json with some json substructure in resources/include or resources/exclude we emit a warning message that we will remove support for pattern-based resource inclusion in the future.

That requires that we use a different json substructure for the new resource inclusion approach. E.g. something like

{
  "include": [
      {
        "pattern": "assets/**"
      },
      {
        "module": "myJavaModule",
        "pattern": "images/**/*.png"
      }
  ]
}

that would not conflict with the existing data because there we do not allow "include" on top-level right now.

olpaw avatar Nov 06 '23 16:11 olpaw

I fully support the idea of the module field here. I would just not use the "include" at all. The resources with an array should signify that the file uses a new pattern. I will update the description.

vjovanov avatar Nov 07 '23 15:11 vjovanov

I fully support the idea of the module field here. I would just not use the "include" at all. The resources with an array should signify that the file uses a new pattern. I will update the description.

The reason I used "include" instead of "resources" for the json-field that holds the array, is so that we do not conflict with the existing definition where we also have the "resources" json-field but it is a json-object instead of a json-array. If we overload the same field name for the new format its harder for the users to gradually switch over to the new format because they cannot have old-style resource definitions and new-style resource definitions in the same resource-config.json file.

But if we do not think that is a problem in practice we can indeed reuse the field name.

olpaw avatar Nov 08 '23 10:11 olpaw

Few notes regarding my discussions with both @vjovanov and @olpaw :

  • Because we already have "old" and "new" format, this format will be the third one. Therefore here are the things we should do:
    • print deprecation message for the first format (where resources are list of patterns and patterns are regexes)
    • if there is an existing resource-config file that user has (that is written in previous format), we should try to merge old and new files automatically. Since old files are using regex and new one will use glob, we should try to transform regexes into globs so we can print all in new format. In some cases we won't be able to do that and in that case we should throw an exception and ask user to manually resolve regex -> glob transformations for list of all patterns we could't resolve (same as resolving git merge conflicts)
    • we should use glob name instead of pattern because first format was the same as this one but used regex instead of glob
  • Currently we are iterating through all patterns for every single resource. We can do it in more efficient way using trie data structure. Here is some basic implementation idea:
    • Since we are using modules, first level of children should be modules, and patterns that don't have module (classpath entries) should be under ALL_UNNAMED child.
    • Trie should be implemented to have more compact structure where nodes should have as much characters as they can (before split is required).
    • Wildcards like * and ** are regular nodes in trie
    • During matching of some resource against trie, we should go as deep as we can. Once we can't go further deep in the trie (we are in * or ** node) we should run FileSystem#getPathMatcher for all children after current node (which should be far less work than matching against all possible patterns we have)
    • This structure should be ImageSingleton so that it can be used in runtime as well for purpose of throwing Missing metadata exceptions
    • In the end we can also think about having one big string which will be combination of all globs. This implementation would:
      • use less memory than trie
      • easier for matching because we can use FileSystem#getPathMatcher
      • more complex for building such a string than creating a trie
      • less readable and harder to debug probably
  • Since we will use FileSystem#getPathMatcher, we should stick to the rules for * and ** defined there and write in our docs what we assume the meaning of those wildcards is
  • We should think about all use cases, but at the moment we think that * and ** is powerful enough to describe all we need currently

dnestoro avatar Nov 09 '23 11:11 dnestoro

Also since we have conditions, probably leaf nodes should store information about conditions for pattern that ends in that node? @vjovanov @olpaw

dnestoro avatar Nov 09 '23 11:11 dnestoro

Once we can't go further deep in the trie (we are in * or ** node) we should run FileSystem#getPathMatcher for all children after current node

Looking int the implementations of FileSystem#getPathMatcher I can see that they all rely on sun.nio.fs.Globs#toUnixRegexPattern. In other words for each time we are processing of a * or ** node we are creating a regex that is then used to match. This will defeat the purpose of being an efficient implementation.

Also not sure if we are willing to have the regex bits of JDK as a core part of every image.

However, if we are willing to have the regex bits as a core image dependency then we would likely be better off with a single optimized regex that we create from all the glob patterns that we need to take into account. If this is done in an intelligent way I expect this to be the most space efficient representation of a resource-matcher we can have in the image. e.g. for glob patterns:

"foo/bar"
"foo/bar/*"
"foo/war/**"

we can create one regex like "foo/((bar/[^/]\+)|(war/.\+))".

N.b Conceptually the effort to create such a single optimized regex out of all glob patterns is equivalent to the building of the trie from the glob patterns.

TL;DR

  • If we are creating our own trie from the glob-entries we should not rely on FileSystem#getPathMatcher
  • If we are fine with jdk regex as a core image component, an optimized single regex from all glob patters is probably more efficient.

Pick your poison ;-)

olpaw avatar Nov 09 '23 12:11 olpaw

I'd suggest using the Trie without FileSystem#getPathMatcher, since java.util.Pattern is far from ideal in situations where just a simple call to String#indexOf('/') would most likely already suffice.

djoooooe avatar Nov 09 '23 13:11 djoooooe

Those are great points! I would implement the basic glob patterns from this article. Easy to link and explain, not system-specific, easy to index, and fast at run time.

vjovanov avatar Nov 14 '23 10:11 vjovanov

Also since we have conditions, probably leaf nodes should store information about conditions for pattern that ends in that node? @vjovanov @olpaw

Yes, we will have to check these conditions at run time in the future so they need to be stored in the trie. If all descendants of the same ancestor would have the same condition we could hoist it up the tree, but I am not sure this optimization is worth the complexity. We will see this when we build the trees in practice.

vjovanov avatar Nov 14 '23 10:11 vjovanov

Fixed with https://github.com/oracle/graal/pull/8715. Thank you @dnestoro!

vjovanov avatar Jun 11 '24 15:06 vjovanov