aas-specs icon indicating copy to clipboard operation
aas-specs copied to clipboard

No UTF32 Characters in the Regex for Strings

Open sebbader-sap opened this issue 1 year ago • 19 comments

Describe the bug

The regex pattern in the JSON Schema has only UTF-16 characters, while constraint AASd-130 demands the following: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$

Where JSON Schema, e.g., https://github.com/admin-shell-io/aas-specs/blob/2ab08f92bdd1d44edc1cfee52552fe5429d2178e/schemas/json/aas.json#L44C22-L44C36

      "pattern": "^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

Additional context Needs to be adopted in the SwaggerHub Domains for Part 1 and Part 2.

sebbader-sap avatar Feb 28 '24 13:02 sebbader-sap

@sebbader-sap we transpiled the patterns into UTF-16 since most JSON schema engines we tested operated on UTF-16 and could not handle UTF-32.

It is a trade-off between correctness and practicality -- if we put UTF-32 in JSON schema (e.g., the pattern you mentioned: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$), many JSON schema validators will fail. I suppose any schema validator written as a library for C# or Java will be in that group.

You can test it online. The first answer on Google for "JSON schema validator" for me is https://www.jsonschemavalidator.net/. This validator does support UTF-32:

{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "title": "AssetAdministrationShellEnvironment",
  "type": "string",
  "pattern": "^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$"
}

The following value passes:

"test"

This is not a big change in aas-core-codegen, so whatever the decision, it shouldn't be hard to fix.

mristin avatar Feb 28 '24 14:02 mristin

Please consider also the SHACL -- I think the same issue appears there as well.

mristin avatar Feb 28 '24 14:02 mristin

The original pattern from the constraint has the same problems with OpenAPI-based validators, as they usually translate the YAML into JSON Schema --> then using the same JSON Schema Validation libraries with the same UTF-32 problems.

I am uncertain how to proceed now. Requirement 1: I want to transform aas.json into the Part 1 Domain. Requirement 2: Attributes from Part 2 with the same meaning as Part 1 attributes shall have exactly the same regex pattern. Requirement 3: Data conforming to Constraint AASd-130 should survive a validation. Requirement 4: Widely used libraries should accept the schema / OpenAPI files.

#3 and #4 are conflicting with each other. However, for Java-based servers, I need to adjust the pattern from Part 2 anyway, so I then rather want to see it 100% correct in the OpenAPI files and do the implementation-specific adjustments on top of it.

Which then means that my actually used OpenAPI file is not string-equals to the IDTA published one anymore...

sebbader-sap avatar Feb 28 '24 14:02 sebbader-sap

@BirgitBoss I think we need a formal decision for all parts. Either way, the Part 2 Domain must go the same way as the Part 1 Domain & the schemas.

sebbader-sap avatar Feb 28 '24 14:02 sebbader-sap

Because there was some clarification needed in the taskforce:

^: Asserts the start of the string.

[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]: Defines a character class that allows various Unicode characters.

\x09: ASCII horizontal tab. \x0A: ASCII linefeed (newline). \x0D: ASCII carriage return. \x20: ASCII space. -: Represents a range. \uD7FF: The upper limit of the Basic Multilingual Plane (BMP) in UTF-16. \uE000-\uFFFD: Represents the range of characters from the start of the supplementary planes up to the last valid Unicode character (excluding surrogate pairs). \u00010000-\u0010FFFF: Represents the range of valid surrogate pairs used for characters beyond the BMP. *: Allows for zero or more occurrences of the characters within the character class.

$: Asserts the end of the string.

g1zzm0 avatar Mar 06 '24 18:03 g1zzm0

Maybe we should change the Constraint AASd-130 in the following way:

An attribute with data type "string" shall consist of these characters only: The string contains only valid Unicode characters encoded in UTF-16 format The string can include common characters like tabs, newlines, carriage returns, and spaces. It allows a broad range of Unicode characters, including those beyond the Basic Multilingual Plane (BMP) which are represented using surrogate pairs in UTF-16 encoding. It ensures that the entire string adheres to the rules of UTF-16 encoding, which is a standard way of representing a wide range of characters from different languages.

g1zzm0 avatar Mar 06 '24 19:03 g1zzm0

I think that the most important bit is missing: this constraint is required, so that the text can be represented in XML.

For example, you can not represent \x00 in XML text, not even with �.

mristin avatar Mar 06 '24 19:03 mristin

https://stackoverflow.com/questions/496321/utf-8-utf-16-and-utf-32

g1zzm0 avatar Mar 07 '24 09:03 g1zzm0

Here is a short example as illustration.

The smiley "😀" is represented as character code 128512. This code is encoded in UTF-32 as "\U0001F600". In UTF-16, this is encoded as "\ud83d\ude00".

Hence, if you want to match this smiley with an regex engine that uses UTF-16, you have to write the pattern "\ud83d\ude00" even though it is a single character. If your regex engine operates on UTF-32, you can simply write "\U0001F600".

mristin avatar Mar 08 '24 11:03 mristin

For example, this schema works on an on-line JSON schema tester:

{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "title": "AssetAdministrationShellEnvironment",
  "type": "string",
  "pattern": "^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"
}

This matches "😀" on https://www.jsonschemavalidator.net/.

The smiley does not test with:

{
  "$schema": "https://json-schema.org/draft/2019-09/schema",
  "title": "AssetAdministrationShellEnvironment",
  "type": "string",
  "pattern": "^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$"
}

mristin avatar Mar 08 '24 11:03 mristin

Maybe we should change the Constraint AASd-130 in the following way:

Update from the latest state of Part 1 V3.1.0: Description for AASd-130 is already extended: https://github.com/admin-shell-io/aas-specs/blob/c9d6c3beb85ee13680ac543f602fc1e96fc57f9c/documentation/IDTA-01001/modules/ROOT/pages/Spec/IDTA-01001_Metamodel_Constraints.adoc?plain=1#L73

sebbader-sap avatar Mar 12 '24 07:03 sebbader-sap

Proposal from a meeting of us (@mristin, @g1zzm0, and myself):

  1. AASd-130 shall be described using ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$ --> no further change needed.
  2. This pattern is mapped to ^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$ for the schemas. In particular, this is the case for the JSON Schema (which is already the case), OpenAPI Part 1 (also already the case, as Part 1 is a translation of the JSON Schema), and OpenAPI Part 2. It's not needed to introduce the pattern to the XML Schema, as XML is the one reason why AASd-130 is needed at first.

Therefore, the following activities are needed:

  • [ ] Replace the current pattern in OpenAPI Part 2 V3.0.2 to the one in 2. See https://github.com/admin-shell-io/aas-specs-api/pull/256
  • [ ] Add a design decision to the serialisation formats in Part 1 stating that the AASd-130 pattern is expressed by another one in the schemas, and that this is due to the case that the currently available tooling is generally operating on UTF-16 characters, therefore usability is significantly higher. Later, we may change the pattern to the original AASd-130 one.
  • [ ] Get formal approval of the AAS Workstream

sebbader-sap avatar Mar 12 '24 07:03 sebbader-sap

Side-effect: Depending on the implementation technology, developers must replace the pattern with the technology-matching regex variant of this pattern.

Example: ^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$ is the pattern in the official JSON schema.

  1. If the server is implemented in Python, which regex engine is already UTF-32 capable, the needed pattern is ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$ (the original one from AASd-130)
  2. If the server is coded with Java, the JSON Schema works out of the box.
  3. If it's in C#, even another pattern is needed: "^[\\u{9}\\u{a}\\u{d}\\u{20}-\\u{d7ff}\\u{e000}-\\u{fffd}\\u{10000}-\\u{10ffff}]*$"
  4. TypeScript: "^[\t\n\r -\ud7ff\ue000-\ufffd\U00010000-\U0010ffff]*$"

sebbader-sap avatar Mar 12 '24 08:03 sebbader-sap

Decision Proposal TF Metamodel AAS 2024-03-27

Change formulation of Contraint AASd-130 from

Constraint AASd-130: an attribute with data type "string" shall consist of these characters only: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$.

to Constraint AASd-130: an attribute with data type "string" shall be restricted to the characters as defined in XML Schema 1.0, i.e. the string shall consist of these characters only: ^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$.

Constraint AASd-130 ensures that encoding and interoperability between different serializations is possible. See https://www.w3.org/TR/xml/#charsets for more information on XML Schema 1.0 string handling.

  • updated and extended formulation regEx etc. to be added as well: https://github.com/admin-shell-io/aas-specs/pull/381

@g1zzm0 : please check

BirgitBoss avatar Mar 27 '24 14:03 BirgitBoss

Proposal from a meeting of us (@mristin, @g1zzm0, and myself): [...]

2. This pattern is mapped to `^([\\t\\n\\r -\ud7ff\ue000-\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$` 

[...]

This representation makes problems in swagger representation

image

How about this RegEx (see \ instead of \ before first ud7ff and before ue000 and ufffd at the beginning): "^([\\t\\n\\r-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

@mristin may you please have a look whether the regex we are using is really ok? Thank you!

BirgitBoss avatar Apr 17 '24 08:04 BirgitBoss

@BirgitBoss wrote:

How about this RegEx (see \ instead of \ before first ud7ff and before ue000 and ufffd at the beginning): "^([\t\n\r-\ud7ff\ue000-\ufffd]|\ud800[\udc00-\udfff]|[\ud801-\udbfe][\udc00-\udfff]|\udbff[\udc00-\udfff])*$"

There is no single standard syntax for regular expressions. It all depends on the engine that you plan to use and support.

Best you fix the engine & test against it, and then also document somewhere why you picked that engine and not another one.

Whatever engine you pick, the particular syntax will be incompatible with some other engine.

mristin avatar Apr 17 '24 10:04 mristin

But independent of the engine, having for some unicode characters one backslash (e.g. "\ud7ff") but for the others two (e.g. "\\ud800") in the same pattern seems pretty strange.

sebbader-sap avatar Apr 17 '24 12:04 sebbader-sap

But independent of the engine, having for some unicode characters one backslash (e.g. "\ud7ff") but for the others two (e.g. "\ud800") in the same pattern seems pretty strange.

Ah, I haven't even noticed that -- yes, it should be consistent.

mristin avatar Apr 17 '24 12:04 mristin

Thank you @sebbader-sap and @mristin for reviewing, so we change to "^([\\t\\n\\r-\\ud7ff\\ue000-\\ufffd]|\\ud800[\\udc00-\\udfff]|[\\ud801-\\udbfe][\\udc00-\\udfff]|\\udbff[\\udc00-\\udfff])*$"

to have a consistent way and it would also be supported by swagger.

BirgitBoss avatar Apr 17 '24 13:04 BirgitBoss

Workstream AAS Specs accepted

BirgitBoss avatar Jun 13 '24 08:06 BirgitBoss

solved in IDTA-01001-3-0-1: https://industrialdigitaltwin.org/wp-content/uploads/2024/06/IDTA-01001-3-0-1_SpecificationAssetAdministrationShell_Part1_Metamodel.pdf

BirgitBoss avatar Jul 16 '24 10:07 BirgitBoss