oscal-cli icon indicating copy to clipboard operation
oscal-cli copied to clipboard

How do you treat Binary Object Markers (BOM) for Unicode?

Open mrjens14 opened this issue 5 months ago • 10 comments

We plan to fork and use this CLI tool in our DevSecOps pipeline and would like to understand how binary object markers are handled. For example, the hex sequence EF BB BF represents the byte order mark (BOM) for UTF-8. When a text file starts with these three bytes, it signals that the file is encoded in UTF-8. However, it seems that this sequence is lost when converting using OSCAL, regardless of whether the format is XML, JSON, or YAML. Git diffs can be difficult to interpret without a hex editor.

mrjens14 avatar Aug 01 '25 05:08 mrjens14

oscal-cli is using the liboscal-java which imports the OSCAL metaschema definition files the developer points to when building a new version of oscal-cli. The OSCAL metaschema files are defined in metaschema implemented in metaschema-java. The only supported data types are the ones defined in metaschema and you can find them here.

iMichaela avatar Aug 01 '25 16:08 iMichaela

@mrjens14 - if you want to further discuss your issue, please reach out to our team at [email protected]

iMichaela avatar Aug 01 '25 16:08 iMichaela

@iMichaela Thank you for your reply. In view of everything as code I think it is worthy to discuss encoding strategies. I will write an email to [email protected]

mrjens14 avatar Aug 04 '25 06:08 mrjens14

Jens, did you look at the source code and (1) have a code fix you'd like us to apply to the source, or (2) you would like us to figure out the fix based on your oscal-cli usage scenario ?

If the former (the option 1) - we can review your code changes submitted as a PR and incorporate them into the oscal-cli tool.

If the latter (the option 2), then it would be nice for us to have the input file examples (not necessarily the real ones, just the minimal example of your workflow) as well as the description of what inputs/outputs are currently-produced vs. expected in the pipeline you are currently running/trying-to-run. Unicode, especially UTF-8 is native to Java. So, I would not expect the fix to be too difficult to do. Also it would be really nice for us to understand your overall workflow in details.

JustKuzya avatar Aug 05 '25 14:08 JustKuzya

@mrjens14: adding some information in the issue, to have full context here.

oscal-cli is encoding the output artifacts in UTF-8

For UTF-8, the encoding scheme consists merely of the UTF-8 code units (= bytes) in sequence. Hence, there is no issue of big- versus little-endian byte order for data represented in UTF-8. I understand that the existing of the BOM vs no-BOM causes a file with identical content to be perceived as different by the git diff.

Since the UTF-8 code units are 8 bits in size, the serialized order of the bytes must not depart from the order defined by the UTF-8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, in the The Unicode Standard 5.0 -Section 2 (page 36), but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

Our experiments indicated that OSCAL artifacts with BOM are validated successfully by the oscal-cli. When converted to another format using the oscal-cli, an artifact no longer has the BOM because it was not deem necessary for UTF-8. This can be added to the serializer, but I still have concerns over the impact to current GRC tools that might not properly parse the UTF-8 +BOM artifacts since they did not expect it in current and previous versions. In such case (the community not favoring the addition of the BOM), we would like to further discuss with you and your team, the possibility for your DevOps pipeline to add the BOM to an OSCAL artifact that does not have it.

I created a branch with example files with BOM in src/test/resources/cli, created from examples without BOM, so we can further discuss this issue. We will follow via email to coordinate a call.

iMichaela avatar Aug 05 '25 22:08 iMichaela

Thank you very much for offering a call. You’ve correctly identified the nature of the issue we’re facing, and we should discuss it before implementing any breaking changes. I will contact you by email to provide more details and arrange a call.

mrjens14 avatar Aug 06 '25 08:08 mrjens14

@iMichaela

but I still have concerns over the impact to current GRC tools that might not properly parse the UTF-8 +BOM artifacts since they did not expect it in current and previous versions.

Thank you, i see it the same, as there is technically no need for a BOM.

  • Not RFC 3629 (page 7) compliant (should not be used as a signature)
  • Technically unnecessary / no need due to byte-oriented encoding
  • No automatic handling in (Java/Rust/Go/C/C++/JavaScript/PHP/Bash/shell scripts)
  • Interoperability is disadvantaged due to deviation from the standard.
  • Problems with GRC solutions (depending on their programming and programming language)
  • Further subsequent problems unclear (surprise package)
  • Contradiction with the "basic idea of ​​why international standards exist at all"

Rusty-Weasel avatar Oct 23 '25 10:10 Rusty-Weasel

@Rusty-Weasel - Thank you for your support and well-documented perspective regarding @mrjens14 's request.

I will close the issue in few days after we give a chance to @mrjens14 to respond. The issue can be reopened if the problem surfaces again.

iMichaela avatar Nov 13 '25 18:11 iMichaela

The latest Unicode Standard (17.0) includes for software developers a flexibility clause on page 1166 of the PDF version. It says, "if consuming UTF-8, recognize and discard a BOM". In my view, this opens the door to future development and is a powerful statement. @iMichaela Thanks for your notification. It is up to you to close this issue. We will discuss the question further in the BSI working group “Werkzeugbox". Please let me know if you would like to join. As an editor, I can send you an invitation and look forward to promoting the OSCAL idea. All issues in which you are directly involved would be written in English.

mrjens14 avatar Nov 13 '25 20:11 mrjens14

@mrjens14 Page 1166 of the PDF version also says:

  • The UTF-8 encoding scheme permits, but does not require, a BOM to be present.
  • Some text processing tools fail to handle BOMs correctly...
  • A text processing tool must maintain additional state...
  • Concatenation of text containing a BOM requires care.
  • In situations where text is known to be encoded as UTF-8, a BOM consumes storage space unnecessarily

Further and this is your mostly usecase, as you'r target is to generate a document, which other (like me) later should use in GRC Application:

  • If producing UTF-8, include a BOM only if explicitly directed to do so, or if a BOM is known to be required by a protocol. <- this case is not given (my-pov)

What you just want, is also mentioned -> Page 1167

  • Otherwise, include a BOM when authoring a UTF-8 text file that contains nonASCII characters, is not targeting a specific protocol, but which may be opened by applications that will not assume UTF-8 by default. (This is useful on systems like Microsoft Windows...)

Please take a note just the next point after, which says:

  • Otherwise, do not include a BOM

Source: https://www.unicode.org/versions/Unicode17.0.0/UnicodeStandard-17.0.pdf

Therefore i would finaly suggest you to get a UTF-8 capable Editor and the problem is solved. (There are tons of capable Editor, and also for MS Visual Studio i gave you already an FAQ on the other discussion.)

Therefore i see no reason, why to change anything on OSCAL.

Rusty-Weasel avatar Nov 13 '25 22:11 Rusty-Weasel