python-opentimestamps
python-opentimestamps copied to clipboard
Question about data length encoding
I’m trying to understand the serialisation format of an ots file. As I understand it commitment operations are encoded like
1 Byte Tag | variable bytes data length | variable bytes operation payload
For example
F0 10 0e 9b 49 ab 01 34 50 80 35 ce 26 97 f1 ad 09 85
can be read as append (F0) the next sixteen (10) bytes.
As I can see in your code data length (10 in this case) is encoded as a signed integer stored in big endian encoding. When I compare this to the Bitcoin wiki I see that there it is the exact opposite, unsigned integer stored in little endian.
https://en.bitcoin.it/wiki/Protocol_documentation#Variable_length_integer
Of course, both ways are valid, but as your implementation is very close to the Bitcoin protocol I want to understand the reason why you are deviating here.
The same is true for the block height of the Bitcoin Attestation. But in this case, there seems to be no “official” way to encode it.
@petertodd I bet you have good reasons. Could you explain? :)
Sorry! Missed this.
Honestly, I think the scheme I chose was probably a mistake. Easier to just use fixed-width integers, and more likely to have compatibility across implementations. As for exactly why I chose that particular scheme, I guess it seemed efficient and rational?
For future work I'm doing I'm sticking to fixed-width, little-endian, encoding integers wherever possible. The encoding scheme Cap'n Proto has a decent rational for that kind of design, and notable also suggests using a separate simple compression scheme to get both simplicity and compactness: https://capnproto.org/
Is there any white paper explaining:
- the
otsformat (Also, why don't you just use a simple json (potentially compessed with tar.gz or alike) where binary are encoded as base64 or in base 16 (like sha256sum I guess)? For such small files readability is preferable over file size in my opinion, especially when people may not trust the decoding algorihm.) - the precise mechanism used to do the timestamp, so that I can manually check if an
.otsis valid? For instance, do you create one transaction per timestamp (costly), or do you aggregate multiple timestamps into a single request?
Having these information would allow people to actually verify themself that a file is valid, without trusting the code used by the platform.
A couple of years ago I dug through the code and documented this for myself. I'm pasting what I documented below, maybe it is of use to you. Basically the .ots file is a variable length byte stream that you need to parse bit by bit and decide how to decode the next byte block depending on byte tag and byte length.
OTS Proof File Format
The content of an ots file is a hex encoded byte stream that contains not only the hash of a claim but also a tree of commitment operations that when executed proof that the claim is linked to the merkle root of a blockchain transaction and can hence attest the claims existence at a given point of time.
File Format
The beginning of the file is static whereas the payload can vary in length and it might also change in case an timestamp is upgraded.
| Field | Description | Size |
|---|---|---|
| File Header (Magic Number) | \x00 O p e n T i m e s t a m p s \x00 \x00 P r o o f \x00 \xbf \x89 \xe2 \xe8 \x84 \xe8 \x92 \x94 or in pure hex \x00 \x4f \x70 \x65 \x6e \x54 \x69 \x6d \x65 \x73 \x74 \x61 \x6d \x70 \x73 \x00 \x00 \x50 \x72 \x6f \x6f \x66 \x00 \xbf \x89 \xe2 \xe8 \x84 \xe8 \x92 \x94 | 31 Bytes |
| Major Version | Defaults to \x01 - which can be read as version one | 1 Byte |
| File Hash | Length of the hash is deducted from the File Hash Algorithm, e.g. 32 Bytes when using SHA256 | variable |
| Payload | Merkle Path(s) and Attestation(s) | variable |
| File Hash Algorithm | Operation is determined by the Tag of a Unary Operation, e.g. 0x08 for SHA256 | 1 Byte |
Payload
The payload of an ots file consists of the commitment operations that are necessary to reach the merkle root of the Bitcoin block header. An operation tree always ends in an attestation. The payload always starts with an tag, a tag is a single byte marker that describes how the byte stream should be processed next. If the tag is a binary (f0, f1) or unary (02, 03, 08, 67) operation, it is followed by one or more bytes encoding an signed integer that describes the length of the data to which the operation should be applied, followed by the data (payload).
Operation block
| Tag | Storage length | Payload |
|---|---|---|
| 1 Byte | variable | variable |
List of marker tags
| Type | Tag | Result length |
|---|---|---|
| Binary Operations | ||
| OpAppend | \xf0 | variable |
| OpPrepend | \xf1 | variable |
| Unary Operations | ||
| OpSha1 | \x02 | 20 Bytes |
| OpRIPEMD160 | \x03 | 20 Bytes |
| OpSha256 | \x08 | 32 Bytes |
| OpKECCAK256 | \x67 | 32 Bytes |
| Attestation | ||
| BlockHeaderAttestation | \x00 | variable |
| BlockHeaderAttestation | \x00FF | variable |
Attestation Block
If the tag is 00 or 00FF the operation tag is followed by an attestation tag.
| Tag | Attestation Tag | Storage length | Payload |
|---|---|---|---|
| 1 Byte | 8 Bytes | variable | variable |
List of attestation tags
| Type | Attestation Tag | Attestation Length |
|---|---|---|
| BitcoinBlockHeaderAttestation | \x05 \x88 \x96 \x0d \x73 \xd7 \x19 \x01 | Block height as signed integer |
| EthereumBlockHeaderAttestation | \x30 \xfe \x80 \x87 \xb5 \xc7 \xea \xd7 | Block height as signed integer |
| LitecoinBlockHeaderAttestation | \x06 \x86 \x9a \x0d \x73 \xd7 \x1b \x45 | Block height as signed integer |
| PendingAttestation | \x83 \xdf \xe3 \x0d \x2e \xf9 \x0c \x8e | Calendar Server URI Length |
If there is more than one attestation in the file the attestation is followed by a unary or binary operation again.
Example - Append Operation
| Tag | Data Length | Data |
|---|---|---|
| F0 | 10 | 0e 9b 49 ab 01 34 50 80 35 ce 26 97 f1 ad 09 85 |
Example - Bitcoin Attestation
| Tag | Attestation Tag | Data Length | Data |
|---|---|---|---|
| 00 | 05 88 96 0D 73 D7 19 01 | 03 | F0 9D 1A |
Data Length Encoding
Bitcoin encodes Storage length in an unsigned int, to quote the Bitcoin wiki.
Integer can be encoded depending on the represented value to save space. Variable length integers always precede an array/vector of a type of data that may vary in length. Longer numbers are encoded in little endian.
| Value | Storage length | Format |
|---|---|---|
| < 0xFD | 1 | uint8_t |
| <= 0xFFFF | 3 | 0xFD followed by the length as uint16_t |
| <= 0xFFFF FFFF | 4 | 0xFE followed by the length as uint32_t |
| - | 9 | 0xFF followed by the length as uint64_t |
Whereas OpenTimestamps encodes Storage length signed, therefore all bytes need to be truncated to 7 bits to get the actual values.
The same is true for the encoding of the height of the Bitcoin Attestation, but there seems to be no "official" way to do it. The only reference can be found in the book "Mastering Bitcoin":
https://www.oreilly.com/library/view/mastering-bitcoin/9781491902639/ch08.html
The first byte, 03, instructs the script execution engine to push the next three bytes onto the script stack (see Table A-1). The next three bytes, 0x443b04, are the block height encoded in little-endian format (backward, least significant byte first). Reverse the order of the bytes and the result is 0x043b44, which is 277,316 in decimal.
Raw Transaction
To link the local merkle tree to an Bitcoin transaction the raw transaction is made a part of the file structure as well. Which gives us the following structure:
Local Merkle tree - Ending in the local Merkle root - prepend raw transaction (until OP_RETURN 0x6a) - append rest of the raw transaction - hash sha 256 - hash sha 256 → which gives us the Bitcoin Transaction ID - rest of the bitcoin merkle tree - bitcoin attestation represented by the block height.
Therefore only the block height is needed to verify the timestamp.
Whoo, thanks a lot for this precise documentation! This should be made public IMHO.
Whoo, thanks a lot for this precise documentation! This should be made public IMHO.
Note that the documentation in the above comment has a number of errors. For instance, the order of fields in the file header is wrong: https://github.com/opentimestamps/python-opentimestamps/blob/5ac3b0642b29e2092c75d164561d37842cd50d29/opentimestamps/core/timestamp.py#L316
For the most precise documentation, you really need executable code. There's no substitute and that's why the OTS spec will continue to be in the form of code. In the future, most likely a Rust library to make types and error conditions more clear.
Is there any white paper explaining:
1. the `ots` format (Also, why don't you just use a simple json (potentially compessed with tar.gz or alike) where binary are encoded as base64 or in base 16 (like sha256sum I guess)? For such small files readability is preferable over file size in my opinion, especially when people may not trust the decoding algorihm.)
The reason why OTS uses a binary format is because one way or another, a format needs to be parsed. A textual format simply means there are two parsing steps rather than one.
2. the precise mechanism used to do the timestamp, so that I can manually check if an `.ots` is valid? For instance, do you create one transaction per timestamp (costly), or do you aggregate multiple timestamps into a single request?
You can use the ots info command to dump information on what exactly is in an OTS proof. That information is sufficient to evaluate a fully upgraded proof yourself by hand. With the -v verbose flag, it'll even show you, step by step, what the result of each opcode is to check your work.