python-opentimestamps icon indicating copy to clipboard operation
python-opentimestamps copied to clipboard

Question about data length encoding

Open etschelp opened this issue 6 years ago • 7 comments

I’m trying to understand the serialisation format of an ots file. As I understand it commitment operations are encoded like

1 Byte Tag | variable bytes data length | variable bytes operation payload

For example

F0 10 0e 9b 49 ab 01 34 50 80 35 ce 26 97 f1 ad 09 85

can be read as append (F0) the next sixteen (10) bytes.

As I can see in your code data length (10 in this case) is encoded as a signed integer stored in big endian encoding. When I compare this to the Bitcoin wiki I see that there it is the exact opposite, unsigned integer stored in little endian.

https://en.bitcoin.it/wiki/Protocol_documentation#Variable_length_integer

Of course, both ways are valid, but as your implementation is very close to the Bitcoin protocol I want to understand the reason why you are deviating here.

The same is true for the block height of the Bitcoin Attestation. But in this case, there seems to be no “official” way to encode it.

etschelp avatar Feb 06 '19 10:02 etschelp

@petertodd I bet you have good reasons. Could you explain? :)

domwoe avatar Feb 22 '19 13:02 domwoe

Sorry! Missed this.

Honestly, I think the scheme I chose was probably a mistake. Easier to just use fixed-width integers, and more likely to have compatibility across implementations. As for exactly why I chose that particular scheme, I guess it seemed efficient and rational?

For future work I'm doing I'm sticking to fixed-width, little-endian, encoding integers wherever possible. The encoding scheme Cap'n Proto has a decent rational for that kind of design, and notable also suggests using a separate simple compression scheme to get both simplicity and compactness: https://capnproto.org/

petertodd avatar Feb 23 '19 00:02 petertodd

Is there any white paper explaining:

  1. the ots format (Also, why don't you just use a simple json (potentially compessed with tar.gz or alike) where binary are encoded as base64 or in base 16 (like sha256sum I guess)? For such small files readability is preferable over file size in my opinion, especially when people may not trust the decoding algorihm.)
  2. the precise mechanism used to do the timestamp, so that I can manually check if an .ots is valid? For instance, do you create one transaction per timestamp (costly), or do you aggregate multiple timestamps into a single request?

Having these information would allow people to actually verify themself that a file is valid, without trusting the code used by the platform.

tobiasBora avatar Jul 06 '22 09:07 tobiasBora

A couple of years ago I dug through the code and documented this for myself. I'm pasting what I documented below, maybe it is of use to you. Basically the .ots file is a variable length byte stream that you need to parse bit by bit and decide how to decode the next byte block depending on byte tag and byte length.

OTS Proof File Format

The content of an ots file is a hex encoded byte stream that contains not only the hash of a claim but also a tree of commitment operations that when executed proof that the claim is linked to the merkle root of a blockchain transaction and can hence attest the claims existence at a given point of time.

File Format

The beginning of the file is static whereas the payload can vary in length and it might also change in case an timestamp is upgraded.

Field Description Size
File Header (Magic Number) \x00 O p e n T i m e s t a m p s \x00 \x00 P r o o f \x00 \xbf \x89 \xe2 \xe8 \x84 \xe8 \x92 \x94 or in pure hex \x00 \x4f \x70 \x65 \x6e \x54 \x69 \x6d \x65 \x73 \x74 \x61 \x6d \x70 \x73 \x00 \x00 \x50 \x72 \x6f \x6f \x66 \x00 \xbf \x89 \xe2 \xe8 \x84 \xe8 \x92 \x94 31 Bytes
Major Version Defaults to \x01 - which can be read as version one 1 Byte
File Hash Length of the hash is deducted from the File Hash Algorithm, e.g. 32 Bytes when using SHA256 variable
Payload Merkle Path(s) and Attestation(s) variable
File Hash Algorithm Operation is determined by the Tag of a Unary Operation, e.g. 0x08 for SHA256 1 Byte

Payload

The payload of an ots file consists of the commitment operations that are necessary to reach the merkle root of the Bitcoin block header. An operation tree always ends in an attestation. The payload always starts with an tag, a tag is a single byte marker that describes how the byte stream should be processed next. If the tag is a binary (f0, f1) or unary (02, 03, 08, 67) operation, it is followed by one or more bytes encoding an signed integer that describes the length of the data to which the operation should be applied, followed by the data (payload).

Operation block

Tag Storage length Payload
1 Byte variable variable

List of marker tags

Type Tag Result length
Binary Operations
OpAppend \xf0 variable
OpPrepend \xf1 variable
Unary Operations
OpSha1 \x02 20 Bytes
OpRIPEMD160 \x03 20 Bytes
OpSha256 \x08 32 Bytes
OpKECCAK256 \x67 32 Bytes
Attestation
BlockHeaderAttestation \x00 variable
BlockHeaderAttestation \x00FF variable

Attestation Block

If the tag is 00 or 00FF the operation tag is followed by an attestation tag.

Tag Attestation Tag Storage length Payload
1 Byte 8 Bytes variable variable

List of attestation tags

Type Attestation Tag Attestation Length
BitcoinBlockHeaderAttestation \x05 \x88 \x96 \x0d \x73 \xd7 \x19 \x01 Block height as signed integer
EthereumBlockHeaderAttestation \x30 \xfe \x80 \x87 \xb5 \xc7 \xea \xd7 Block height as signed integer
LitecoinBlockHeaderAttestation \x06 \x86 \x9a \x0d \x73 \xd7 \x1b \x45 Block height as signed integer
PendingAttestation \x83 \xdf \xe3 \x0d \x2e \xf9 \x0c \x8e Calendar Server URI Length

If there is more than one attestation in the file the attestation is followed by a unary or binary operation again.

Example - Append Operation

Tag Data Length Data
F0 10 0e 9b 49 ab 01 34 50 80 35 ce 26 97 f1 ad 09 85

Example - Bitcoin Attestation

Tag Attestation Tag Data Length Data
00 05 88 96 0D 73 D7 19 01 03 F0 9D 1A

Data Length Encoding

Bitcoin encodes Storage length in an unsigned int, to quote the Bitcoin wiki.

Wiki: Variable_length_integer

Integer can be encoded depending on the represented value to save space. Variable length integers always precede an array/vector of a type of data that may vary in length. Longer numbers are encoded in little endian.

Value Storage length Format
< 0xFD 1 uint8_t
<= 0xFFFF 3 0xFD followed by the length as uint16_t
<= 0xFFFF FFFF 4 0xFE followed by the length as uint32_t
- 9 0xFF followed by the length as uint64_t

Whereas OpenTimestamps encodes Storage length signed, therefore all bytes need to be truncated to 7 bits to get the actual values.

The same is true for the encoding of the height of the Bitcoin Attestation, but there seems to be no "official" way to do it. The only reference can be found in the book "Mastering Bitcoin":

https://www.oreilly.com/library/view/mastering-bitcoin/9781491902639/ch08.html

The first byte, 03, instructs the script execution engine to push the next three bytes onto the script stack (see Table A-1). The next three bytes, 0x443b04, are the block height encoded in little-endian format (backward, least significant byte first). Reverse the order of the bytes and the result is 0x043b44, which is 277,316 in decimal.

Raw Transaction

To link the local merkle tree to an Bitcoin transaction the raw transaction is made a part of the file structure as well. Which gives us the following structure:

Local Merkle tree - Ending in the local Merkle root - prepend raw transaction (until OP_RETURN 0x6a) - append rest of the raw transaction - hash sha 256 - hash sha 256 → which gives us the Bitcoin Transaction ID - rest of the bitcoin merkle tree - bitcoin attestation represented by the block height.

Therefore only the block height is needed to verify the timestamp.

etschelp avatar Jul 07 '22 09:07 etschelp

Whoo, thanks a lot for this precise documentation! This should be made public IMHO.

tobiasBora avatar Jul 07 '22 11:07 tobiasBora

Whoo, thanks a lot for this precise documentation! This should be made public IMHO.

Note that the documentation in the above comment has a number of errors. For instance, the order of fields in the file header is wrong: https://github.com/opentimestamps/python-opentimestamps/blob/5ac3b0642b29e2092c75d164561d37842cd50d29/opentimestamps/core/timestamp.py#L316

For the most precise documentation, you really need executable code. There's no substitute and that's why the OTS spec will continue to be in the form of code. In the future, most likely a Rust library to make types and error conditions more clear.

petertodd avatar Jul 07 '22 16:07 petertodd

Is there any white paper explaining:

1. the `ots` format (Also, why don't you just use a simple json (potentially compessed with tar.gz or alike) where binary are encoded as base64 or in base 16 (like sha256sum I guess)? For such small files readability is preferable over file size in my opinion, especially when people may not trust the decoding algorihm.)

The reason why OTS uses a binary format is because one way or another, a format needs to be parsed. A textual format simply means there are two parsing steps rather than one.

2. the precise mechanism used to do the timestamp, so that I can manually check if an `.ots` is valid? For instance, do you create one transaction per timestamp (costly), or do you aggregate multiple timestamps into a single request?

You can use the ots info command to dump information on what exactly is in an OTS proof. That information is sufficient to evaluate a fully upgraded proof yourself by hand. With the -v verbose flag, it'll even show you, step by step, what the result of each opcode is to check your work.

petertodd avatar Jul 07 '22 16:07 petertodd