aas-specs
aas-specs copied to clipboard
No UTF32 Characters in the Regex for Strings
Describe the bug
The regex pattern in the JSON Schema has only UTF-16 characters, while constraint AASd-130 demands the following:
^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$
Where JSON Schema, e.g., https://github.com/admin-shell-io/aas-specs/blob/2ab08f92bdd1d44edc1cfee52552fe5429d2178e/schemas/json/aas.json#L44C22-L44C36
"pattern": "^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"
Additional context Needs to be adopted in the SwaggerHub Domains for Part 1 and Part 2.
@sebbader-sap we transpiled the patterns into UTF-16 since most JSON schema engines we tested operated on UTF-16 and could not handle UTF-32.
It is a trade-off between correctness and practicality -- if we put UTF-32 in JSON schema (e.g., the pattern you mentioned: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$
), many JSON schema validators will fail. I suppose any schema validator written as a library for C# or Java will be in that group.
You can test it online. The first answer on Google for "JSON schema validator" for me is https://www.jsonschemavalidator.net/. This validator does support UTF-32:
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "AssetAdministrationShellEnvironment",
"type": "string",
"pattern": "^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$"
}
The following value passes:
"test"
This is not a big change in aas-core-codegen, so whatever the decision, it shouldn't be hard to fix.
Please consider also the SHACL -- I think the same issue appears there as well.
The original pattern from the constraint has the same problems with OpenAPI-based validators, as they usually translate the YAML into JSON Schema --> then using the same JSON Schema Validation libraries with the same UTF-32 problems.
I am uncertain how to proceed now. Requirement 1: I want to transform aas.json into the Part 1 Domain. Requirement 2: Attributes from Part 2 with the same meaning as Part 1 attributes shall have exactly the same regex pattern. Requirement 3: Data conforming to Constraint AASd-130 should survive a validation. Requirement 4: Widely used libraries should accept the schema / OpenAPI files.
#3
and #4
are conflicting with each other. However, for Java-based servers, I need to adjust the pattern from Part 2 anyway, so I then rather want to see it 100% correct in the OpenAPI files and do the implementation-specific adjustments on top of it.
Which then means that my actually used OpenAPI file is not string-equals to the IDTA published one anymore...
@BirgitBoss I think we need a formal decision for all parts. Either way, the Part 2 Domain must go the same way as the Part 1 Domain & the schemas.
Because there was some clarification needed in the taskforce:
^: Asserts the start of the string.
[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]: Defines a character class that allows various Unicode characters.
\x09: ASCII horizontal tab. \x0A: ASCII linefeed (newline). \x0D: ASCII carriage return. \x20: ASCII space. -: Represents a range. \uD7FF: The upper limit of the Basic Multilingual Plane (BMP) in UTF-16. \uE000-\uFFFD: Represents the range of characters from the start of the supplementary planes up to the last valid Unicode character (excluding surrogate pairs). \u00010000-\u0010FFFF: Represents the range of valid surrogate pairs used for characters beyond the BMP. *: Allows for zero or more occurrences of the characters within the character class.
$: Asserts the end of the string.
Maybe we should change the Constraint AASd-130 in the following way:
An attribute with data type "string" shall consist of these characters only: The string contains only valid Unicode characters encoded in UTF-16 format The string can include common characters like tabs, newlines, carriage returns, and spaces. It allows a broad range of Unicode characters, including those beyond the Basic Multilingual Plane (BMP) which are represented using surrogate pairs in UTF-16 encoding. It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.
I think that the most important bit is missing: this constraint is required, so that the text can be represented in XML.
For example, you can not represent \x00
in XML text, not even with �
.
https://stackoverflow.com/questions/496321/utf-8-utf-16-and-utf-32
Here is a short example as illustration.
The smiley "😀" is represented as character code 128512. This code is encoded in UTF-32 as "\U0001F600". In UTF-16, this is encoded as "\ud83d\ude00".
Hence, if you want to match this smiley with an regex engine that uses UTF-16, you have to write the pattern "\ud83d\ude00" even though it is a single character. If your regex engine operates on UTF-32, you can simply write "\U0001F600".
For example, this schema works on an on-line JSON schema tester:
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "AssetAdministrationShellEnvironment",
"type": "string",
"pattern": "^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"
}
This matches "😀" on https://www.jsonschemavalidator.net/.
The smiley does not test with:
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "AssetAdministrationShellEnvironment",
"type": "string",
"pattern": "^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$"
}
Maybe we should change the Constraint AASd-130 in the following way:
Update from the latest state of Part 1 V3.1.0: Description for AASd-130 is already extended: https://github.com/admin-shell-io/aas-specs/blob/c9d6c3beb85ee13680ac543f602fc1e96fc57f9c/documentation/IDTA-01001/modules/ROOT/pages/Spec/IDTA-01001_Metamodel_Constraints.adoc?plain=1#L73
Proposal from a meeting of us (@mristin, @g1zzm0, and myself):
- AASd-130 shall be described using
^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$
--> no further change needed. - This pattern is mapped to
^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$
for the schemas. In particular, this is the case for the JSON Schema (which is already the case), OpenAPI Part 1 (also already the case, as Part 1 is a translation of the JSON Schema), and OpenAPI Part 2. It's not needed to introduce the pattern to the XML Schema, as XML is the one reason why AASd-130 is needed at first.
Therefore, the following activities are needed:
- [ ] Replace the current pattern in OpenAPI Part 2 V3.0.2 to the one in 2. See https://github.com/admin-shell-io/aas-specs-api/pull/256
- [ ] Add a design decision to the serialisation formats in Part 1 stating that the AASd-130 pattern is expressed by another one in the schemas, and that this is due to the case that the currently available tooling is generally operating on UTF-16 characters, therefore usability is significantly higher. Later, we may change the pattern to the original AASd-130 one.
- [ ] Get formal approval of the AAS Workstream
Side-effect: Depending on the implementation technology, developers must replace the pattern with the technology-matching regex variant of this pattern.
Example:
^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$
is the pattern in the official JSON schema.
- If the server is implemented in Python, which regex engine is already UTF-32 capable, the needed pattern is
^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$
(the original one from AASd-130) - If the server is coded with Java, the JSON Schema works out of the box.
- If it's in C#, even another pattern is needed:
"^[\\u{9}\\u{a}\\u{d}\\u{20}-\\u{d7ff}\\u{e000}-\\u{fffd}\\u{10000}-\\u{10ffff}]*$"
- TypeScript:
"^[\t\n\r -\ud7ff\ue000-\ufffd\U00010000-\U0010ffff]*$"
Decision Proposal TF Metamodel AAS 2024-03-27
Change formulation of Contraint AASd-130 from
Constraint AASd-130: an attribute with data type "string" shall consist of these characters only: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$.
to
Constraint AASd-130: an attribute with data type "string" shall be restricted to the characters as defined in XML Schema 1.0, i.e. the string shall consist of these characters only: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$.
Constraint AASd-130 ensures that encoding and interoperability between different serializations is possible. See https://www.w3.org/TR/xml/#charsets for more information on XML Schema 1.0 string handling.
- updated and extended formulation regEx etc. to be added as well: https://github.com/admin-shell-io/aas-specs/pull/381
@g1zzm0 : please check
Proposal from a meeting of us (@mristin, @g1zzm0, and myself): [...]
2. This pattern is mapped to `^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$`
[...]
This representation makes problems in swagger representation
How about this RegEx (see \ instead of \ before first ud7ff and before ue000 and ufffd at the beginning):
"^([\\t\\n\\r-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"
@mristin may you please have a look whether the regex we are using is really ok? Thank you!
@BirgitBoss wrote:
How about this RegEx (see \ instead of \ before first ud7ff and before ue000 and ufffd at the beginning): "^([\t\n\r-\ud7ff\ue000-\ufffd]|\ud800[\udc00-\udfff]|[\ud801-\udbfe][\udc00-\udfff]|\udbff[\udc00-\udfff])*$"
There is no single standard syntax for regular expressions. It all depends on the engine that you plan to use and support.
Best you fix the engine & test against it, and then also document somewhere why you picked that engine and not another one.
Whatever engine you pick, the particular syntax will be incompatible with some other engine.
But independent of the engine, having for some unicode characters one backslash (e.g. "\ud7ff") but for the others two (e.g. "\\ud800") in the same pattern seems pretty strange.
But independent of the engine, having for some unicode characters one backslash (e.g. "\ud7ff") but for the others two (e.g. "\ud800") in the same pattern seems pretty strange.
Ah, I haven't even noticed that -- yes, it should be consistent.
Thank you @sebbader-sap and @mristin for reviewing, so we change to
"^([\\t\\n\\r-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"
to have a consistent way and it would also be supported by swagger.
Workstream AAS Specs accepted
solved in IDTA-01001-3-0-1: https://industrialdigitaltwin.org/wp-content/uploads/2024/06/IDTA-01001-3-0-1_SpecificationAssetAdministrationShell_Part1_Metamodel.pdf