simdjson icon indicating copy to clipboard operation
simdjson copied to clipboard

Discussion regarding a new 16-byte tape (comments invited)

Open lemire opened this issue 5 years ago • 11 comments
trafficstars

Here is the first sketch of a proposal to have a (wider and flatter) 16-byte tape up from our current hybrid 8-byte tape. (This is issue https://github.com/lemire/simdjson/issues/361) In this new tape, all nodes would be represented with a single 16-byte value. Except for scopes (arrays and objects) that would contain two 16-byte values. The tape could be iterated efficiently in either order (forward and backward).

The tape would have an accompanying string tape, as currently, but the string tape would contain only UTF-8 string content, the strings would be appended one after the other. Currently, the string tape contains 32-bit length-prefixed strings. In the new model, the string length would be part of the tape.

Having exactly 16-byte values all the way ensures that we could navigate the document backward and forward without any problem. (Issue https://github.com/lemire/simdjson/issues/292 ) It can also improve performance when navigating in forward order since there is no possible misprediction when loading the next tape element.

We would support arbitrarily large JSON documents containing arbitrarily large strings.

The new string tape would be made of just valid UTF-8 string content, so we could do UTF-8 validation there, instead of doing it in stage 1 as is done currently. This could potentially improve the performance drastically. This is issue https://github.com/lemire/simdjson/issues/185

General formal of the tape elements

We would reserve a byte out of each tape element for an ASCII character identifying the nature of the node 't', 'f', 'n', 'l', 'u', 'd', '"', '{', '}', '[', ']' ,'r' (where 'r' stands for root)

Simple JSON values

Simple JSON nodes are represented with one tape element containing the type 'n', 't', 'f' and nothing else. This is massively wasteful of memory since we use one byte out of 16. But it is not clear what else to do without wasting performance.

Both 64-bit ARM and x64 can turn a 16-byte write into a single instruction, if the write to the tape is a constant value.

Integer and Double values

Numbers are already represented as 16-byte values on the tape, so that they would not change.

Integer values are represented as two 64-bit tape elements:

  • The 64-bit value ('l' << 56) followed by the 64-bit integer value litterally. Integer values are assumed to be signed 64-bit values, using two's complement notation.
  • The 64-bit value ('u' << 56) followed by the 64-bit integer value litterally. Integer values are assumed to be unsigned 64-bit values.

Float values are represented as two 64-bit tape elements:

  • The 64-bit value ('d' << 56) followed by the 64-bit double value litterally in standard IEEE 754 notation.

We have 7 bytes of free space, but it is not clear what to do with it.

There are demands to support big integers and arbitrary-precision values. We could engineer a special secondary tape where such values are coded, and we could refer to this off-tape value.

Performance consideration: We store numbers of the main tape because we believe that locality of reference is helpful for performance.

Root node

Each JSON document will have two special tape elements representing a root node, one at the beginning and one at the end.

  • The first 16-byte tape element contains the marker value 'r' and the location on the tape of the last root element as a 64-bit value.
  • The last 16-byte tape element contains the value 'r'.

All of the parsed document is located between these two tape elements.

It might be possible to use the extra space leftover to store other useful information.

Hint: We can read the first tape element to determine the length of the tape.

Strings

We store string values using UTF-8 encoding with null termination on a separate tape. A string value is represented on the main tape as the 16-byte tape element with the marker '"' and a 64-bit pointer to the string tape. We use the remaining 7 bytes to store the length of the string. Thus we would limit strings within JSON documents to 2^58 bytes, but that's truly enormous.

Short strings (proposal)

We would introduce a "short string" type. It would be made of short strings, containing fewer than 16 bytes and not null-terminated. We would use null padding.

The benefit of these short strings is that they would fit in the main tape, so that they could drastically improve the performance when querying the documents. Many JSON documents are filled with short strings.

Arrays

JSON arrays are represented using two 16-byte tape elements.

  • The first 16-byte tape element contains the value marker '['. We store the number of elements in the array. We store a pointer to second 16-byte tape element.
  • The second 16-byte tape element contains the value ']'. We store the number of elements in the array. e a pointer to first 16-byte tape element.

All the content of the array is located between these two tape elements, including arrays and objects.

Knowing up front how many elements are in the array would solve issue https://github.com/lemire/simdjson/issues/308

Explanation: we need a first and last tape element if we are to be able to navigate the document in backward order.

Performance consideration: We can skip the content of an array entirely by accessing the first tape element, reading the payload and moving to the corresponding index on the tape.

Objects

JSON objects are represented using two 16-byte tape elements.

  • The first 16-byte tape element contains the marker value '{'. We store the number of keys in the object. We store a pointer to second 16-byte tape element.
  • The second 16-byte t tape element contains the marker value }'. We store the number of keys in the object. We store a pointer to first 16-byte tape element.

In-between these two tape elements, we alternate between key (which must be strings) and values. A value could be an object or an array.

All the content of the object is located between these two tape elements, including arrays and objects.

Performance consideration: We can skip the content of an object entirely by accessing the first tape element, reading the payload and moving to the corresponding index on the tape.

Trade-offs

A 16-byte tape would be able to support very large files whereas our 8-byte tape is limited to 4GB files. However, somewhat ironically, a 16-byte tape might use more memory and thus make page allocation more expensive, a bottleneck when processing large files (Doubling the size of the tape would make page allocation more expensive and could make the processing of large files much more expensive: https://github.com/lemire/simdjson/pull/443).

On some systems, memory allocation runs far slower than we can parse (e.g., 1.4GB/s), especially when using small pages.

cc @jkeiser

lemire avatar Dec 31 '19 16:12 lemire

Backwards and forwards iteration is a real thing, and seems worthwhile. Local 15-byte strings would be awesome!

One thought: if we broke up the storage across two separate tapes--each with the same size and number of elements, and each element in the same position--it would be more cache-friendly, since you're not typically going both forward and backward through the tape. Unfortunately, that isn't compatible with 15-byte strings (max "short string" would be 7).

jkeiser avatar Jan 03 '20 23:01 jkeiser

if we broke up the storage across two separate tapes

I am concerned that we may pay a price during parsing and tape generation.

lemire avatar Jan 04 '20 03:01 lemire

Before moving this forward with this deep redesign, I would like to know how to do stage 1 only on small subblocks... Indeed, this wide tape would allow us to process 8 GB files if needed, but it does not make sense if stage 1 has to be over the whole file. I have an ugly attempt, and Jérémie did some work, but this needs to be revisited in light of @jkeiser new engineering.

lemire avatar Jan 04 '20 17:01 lemire

+1 :)

  1. The advice if I may: be very careful not to get carried away with C++20 claims about UTF8.
  2. the "tape" is in essence simdjson json container. making it such + the iterator into behaving like C++17 container and iterator, might be very welcome by users.
    1. this is all in the memory right?
  3. I still can not connect the design and the implementation, a new design diagram would help? Perhaps I should do it.
  4. Before moving forward. Ideally, I would like to have namespace simdjson::tape with all the abstractions related to the Tape concept. I would like to do that too.

DBJDBJ avatar Jan 09 '20 14:01 DBJDBJ

Doubling the size of the tape would make page allocation more expensive and could make the processing of large files much more expensive: https://github.com/lemire/simdjson/pull/443

lemire avatar Jan 14 '20 15:01 lemire

I'm mentally moving towards either a one-pass compilation, or an n-pass compilation that compiles a c chunk at a time.

jkeiser avatar Jan 17 '20 01:01 jkeiser

Can you please define the "chunk" in this context?

DBJDBJ avatar Jan 17 '20 15:01 DBJDBJ

In general I just mean "a contiguous series of bytes from the input."

jkeiser avatar Jan 17 '20 23:01 jkeiser

@jkeiser I meant logically.

JSON is a hierarchical structure. Is one chunk, one node or 1..N nodes? How big is the N? It seems the subtree is the chunk ... inside a certain size ...? Etc ...

DBJDBJ avatar Jan 18 '20 00:01 DBJDBJ

Chunk of text. Json is a hierarchical structure, but there's nothing saying you can't stop in the middle.

I'm being vague on purpose because I don't really know the numbers. The main point is to save enough parser state to pick up where you left off.

The size would be small enough to fit nicely in cache but big enough to get some nice warm loops running. I don't really know how large until it's actually in practice.

jkeiser avatar Jan 18 '20 08:01 jkeiser

It may be useful to limit short strings to 14 bytes or less, with no embedded nulls (not just terminating nulls), so that all "short strings" are guaranteed to be null-terminated. Lets people use the usual suspects like strlen if they need to, since we're not storing the length directly anymore.

jkeiser avatar Feb 05 '20 22:02 jkeiser