daffodil icon indicating copy to clipboard operation
daffodil copied to clipboard

Rebased - so github testing will run - New Layer API & Improved Implementation

Open mbeckerle opened this issue 11 months ago • 3 comments

This commit fixes the remaining issues open about Daffodil's Layers extension/plugins. Partial progress was in commit ff9ba4. This finishes that work.

  • DAFFODIL-2845 - Layer Improvements (umbrella ticket)
  • DAFFODIL-2844 - LayerNotEnoughDataException ...
  • DAFFODIL-2825 - Define Supported Layering API
  • DAFFODIL-2841 - Layers do not support created SDE or PE/UE Errors
  • DAFFODIL-2843 - How to create PE/UE from a layer

The layer API is all defined in Java clases and interfaces now. One of the built-in layers (Gzip) is written in Java now.

There is only one layer property now: dfdlx:layer which takes a QName of the layer class.

Layer classes are derived from

org.apache.daffodil.runtime1.layers.api.Layer

A layer a name and target namespace, and hence a QName is required to identify it. This namespace is OWNED by the layer. All DFDL variables defined in that namespace are either used to pass parameters to the layer code, or receive results (such as a checksum) back from the layer code. This is enforced. A layer that has no DFDL variables does not have to define a DFDL schema that defines the layer's target namespace, but any layer that uses DFDL variables must define a schema with the layer's namespace as its target namespace, with the variables declared in it (using dfdl:defineVariable).

There is also an abstract base class for defining checksum layers called,

org.apache.daffodil.runtime1.layers.api.ChecksumLayer

All the layer properties about layer length kinds etc. are gone.

dfdlx:layerLengthKind='boundaryMark' and dfdlx:boundaryMark properties are replaced by a layer named 'boundaryMark'.

dfdlx:layerLengthKind='explicit' and dfdlx:layerLength properties are replaced by a layer named 'fixedLength'.

dfdlx:layerLengthKind='implicit' is replaced by just not using one of the two above layers.

dfdlx:layerTransform is replaced by just dfdlx:layer and is now a QName. Layers no longer have both transformers and length-limiters mixed together.

There are two ways to do explicit layer limiting, that differ in an important way.

If you use 'implicit' layer length (no restriction layer), and constrain the layer length by a surrounding element of specified length, then the layer length is controlled when parsing, but NOT controlled when unparsing.

If you use the layer "fixedLength", that dictates the parse and unparse length to be the value of the layerLength variable. Similarly if you use a checksum layer built with the ChecksumLayer base class, the length is controlled for both parsing and unparsing as it works like the fixedLength layer.

Layer Variables:

Each variable is either a parameter passed to a special setter named setLayerVariableParameters, or is a return variable which is populated from a special getter. The special setter args and getters taken together must use up all the DFDL variables defined in the layer's namespace. (See the javadoc for the Layer base class.)

A special getter example might be:

int getLayerVariableResult_checksum() { .... }

This corresponds to, and returns the value that populates, the DFDL variable named 'checksum' in the layer's namespace. That DFDL variable must have a DFDL type corresponding to the int type (xs:unsignedShort or xs:int).

Layers could have multiple result variables, but we have no actual examples of more than one value being returned. (Returning multiple variables is tested. We just have no real use cases for it at this time.)

Layer Exception Handling:

Added setProcessingErrorException(...) method to Layer

This allows the layer to specify that if the layer throws specific exceptions or runtime exceptions that they are converted into processing errors. This eliminates most need for layers to contain try-catches.

Additional Fixes in this Commit:

Fixed gzip layer which was not constructing on Java 17 or 21 due to static initializer differences of some sort.

Removed restriction that a layer can only have a single term inside it.

Added close() to InputSourceDataInputStream, as we weren't closing these. They now implement java.io.Closeable so can use try-with-resources for auto closing in Java.

There are a large number of simplifications in the "layering" of the various layer-related implementation classes.

LayerRegistry is gone, merged into LayerFactory, which is gone, merged into LayerDriver.

In the examples/tests for layering: Changed check digit bad digit style to construct an always-invalid element instead of an assertion failure.

About Testing:

The testing is quite extensive now. There are tests for every way that a user can goof up the definition of a layer class, and there are tests for processingError, runtimeSchemaDefinitionError, and throwing an Exception from every place a user-defined Layer could cause these. Parse errors cause backtracking in all sensible cases. Nothing aborts.

Protecting every place that layer code is called from any sort of error/throw requires quite a number of try/catch blocks. This may have performance implications.

Add some fuzz testing around gzip layer

GZip layer converts IOExceptions to PE explicitly.

Fuzz tests show that gzip does a good job converting various bad data issues into IOException, so really that's the only one the GzipLayer has to deal with.

Changed boundaryMarkLayer and gzip layer and AIS (test) layer to use setProcessingErrorException to handle IOExceptions.

(and the test SimpleBombOutLayer.scala used to drive negative tests.)

Added bug-related comments to other files about bugs discovered while fuzz-testing of other non-daffodil layer schemas.

DEPRECATION/COMPATABILITY

This new Layer API is 100% incompatible with schemas or layer code from Daffodil 3.6.0 and all prior versions of Daffodil. The layer feature was just an experimental feature in Daffodil, so we reserved the right to change it.

However, it is our hope that the new Layer API introduced by this commit will prove to be stable and supportable indefinitely.

DAFFODIL-2845, DAFFODIL-2844, DAFFODIL-2825, DAFFODIL-2841, DAFFODIL-2843

mbeckerle avatar Mar 24 '24 18:03 mbeckerle

I started reviewing yesterday and got through 33 of the 163 files before resuming my project work. I have only 2 very minor comments so far but I will review the rest of the files over time. (I just noticed I was reviewing an older PR, will review this PR instead.)

tuxji avatar Mar 26 '24 17:03 tuxji

I started reviewing yesterday and got through 33 of the 163 files before resuming my project work. I have only 2 very minor comments so far but I will review the rest of the files over time. (I just noticed I was reviewing an older PR, will review this PR instead.)

Note that both PRs should be identical. I just squashed/rebased this one to get github to re-run testing for me and to get up-to-date codecov results. So you can continue on the other if that's where you started.

mbeckerle avatar Mar 26 '24 17:03 mbeckerle

+1 with a few suggestions

I wonder if layers will become so important and widely used that Daffodil code generators for the smallest possible deterministic subset of DFDL will have to implement layers? Probably not, but speculate away.

They will. Checksums and CRCs are important. Decompression?, base64 decoding? Maybe not.

Though important wire formats that uses say, mil-std-2045 headers, have a 2 big field in that header that indicates the body is compressed via one of 3 algorithms, or none. That said, I've never seen data with that header that uses the compression feature. It's just theoretically possible.

mbeckerle avatar Mar 27 '24 23:03 mbeckerle