synth
synth copied to clipboard
Add support for ingesting/synthesizing custom binary data file
Required Functionality
While binary data can come in many shapes and forms, the particular format I'm after is unencoded/uncompressed binary data that have different fields packed next to each other. Additionally, the file begins with a header, and is concluded by a footer. In the middle, is the payload data, where entries are repeated many times.
Here is a pictorial of such a format:
Header |
---|
Entry 1 |
Entry 2 |
... |
Entry N |
Footer |
Each entry is of fixed size, and can have multiple fields of different data types occupying a different amount of bytes. Example:
timestamp (8 bytes) | my_u32 (4 bytes) | my_bool (1 byte) | my_string (24 bytes) |
---|
Proposed Solution
The user will be required to supply additional schema info to tell synth how to parse the fields. A possible format may look something like this:
"binary_schema": {
"entry_size_bytes": 37,
"is_little_endian": true,
"payload_start_offset_bytes": 4096,
"payload_end_offset_bytes": 1024, -> this will be bytes from the end of the file
"fields": [
{
"name": "timestamp",
"type": "u64",
"byte_start": 0,
"byte_end": 7
},
{
"name": "my_u32",
"type": "u32",
"byte_start": 8,
"byte_end": 11
},
{
"name": "my_bool",
"type": "bool",
"byte_start": 12,
"byte_end": 12
},
{
"name": "my_string",
"type": "string",
"byte_start": 13,
"byte_end": 36
}
]
}
Such a binary schema can also be used to define extensions in the future, like encoding, var-length data etc.
Synth should be able to take such a schema and data file, infer from it, and output a variant of the fields. A nice to have would be to take the original data file's header and footer, and stuff it into the generated file as is.
Use case The use case pertains to protocol data files used in the storage industry. NVMe is one example. Other storage and networking protocols typically follow such a format to some degree, as well.
I just talked to a former colleague who works in statistical data processing. For interop reasons they work with binary files containing the data as fixed-width rows of little-endian 16-bit integers; sometimes 32 bit integers for larger value ranges.
They could also make use of such a feature.
@fretz12 thanks for this.
This is a really interesting use case that requires some additional core features to be introduced to synth.
Some notes:
- We probably need to add a new variant to
synth_core::graph::Value
which represents binary data. We could use the bytes crate. - How would this data be serialized? Given that synth currently outputs (primarily) JSON, would we need to run this binary data through some encoding? Or does your use case require it to be written directly to a file?
@christoshadjiaslanis -
Regarding 2, I'm only needing the synthesized data to be written to a file.
Bear with me if I'm making naive suggestions... but I'm thinking there would be a BinaryFileExportStrategy
(impl ExportStrategy
) that would still extract synthesized JSON Value
. Unlike the other export strategies where it would insert the synthesized fields out of Value
into a DB, it would look at the user supplied "binary_schema" i mentioned above, and serialize the fields out of Value
into a file in the correct order.
Though I think our binary format is fairly simplistic, binary formats in general can be wildly varying. One thought is that for hard to customize things like SerDes, perhaps using a plugin interface where the user can write their own (de)serializers and make a .so or .dylib out of it, and synth would load those dynamic libs to execute serdes, with APIs binding binary -> Value
and Value
-> binary
@llogiq - thx!
Let's try that again:
@all-contributors please add @fretz12 for awesome ideas.
@fretz12 yeah I think that for binary serializers we need user-defined serializers.
It's not a fully formed thought yet, but roughly speaking we have our existing schema which defines how data is generated, and a second piece of config which needs to dictate how that data is mapped to a binary serialization format.
I'm not sure how this would work exactly. Perhaps we can create an RFC for this and try to design something that makes sense.