attribution-reporting-api
attribution-reporting-api copied to clipboard
Consider encoding aggregate attribution reports with CBOR
Currently, aggregate attribution reports are encoded with JSON. Given that they contain binary data (the encrypted payloads) and large integers (~2^41, for timestamps), it may be preferable to use CBOR. This would avoid the necessity of base64 encoding the binary and using strings for the large integers, simplifying the format. It would also allow a more concise encoding, mainly due to base64 encoding adding ~33% overhead to the large payloads (although this could be mitigated by gzipping the entire report). CBOR also reduces processing costs on the data. CBOR is already an Internet Standard and implementations exist for many different languages.
Additionally, some information in the report needs to be provided to both the helper and the reporting endpoint (e.g. scheduled report time, privacy budget key, version). Instead of duplicating this information both inside and outside the encrypted payload, we could remove it from the payload and have the reporting endpoint send it along with each report. To ensure the integrity of this information, we could use AEAD (Authenticated Encryption with Associated Data). Using CBOR to encode the report would allow for a section of the report (e.g. a ‘shared info’ map containing this data) to be interpreted directly as (CBOR-encoded) bytes. As the authentication requires byte-for-byte identical data, this may reduce the risk of incorrect rejection as parsing and re-serializing could cause small changes (e.g. JSON with different spacing).
JSON however is human readable, and therefore simpler to debug and understand. JSON is also a more common format, is more familiar to developers, and has more tooling support. Using CBOR would introduce an inconsistency between the event-level and aggregate report formats.
Note that we also need to decide whether to use CBOR or JSON for the (unencrypted) payload. Similar considerations apply, but the only parties that need to read these payloads are the helper servers who may have different performance/tooling considerations.
About CBOR
Summary of the meeting discussion
8/9 WICG meeting (see Meeting Notes 8/9)
Pros of using CBOR:
- Avoid the necessity of base64 encoding the binary and using strings for the large integers, simplifying the format
- Allows for a more concise encoding, mainly due to base64 encoding adding ~33% overhead to the large payloads (although this could be mitigated by gzipping the entire report).
- Reduces processing costs on the data
- From @csharrison : One benefit of using a binary encoding is that we’ll be able to remove a possible class of bugs. In the current design there is a bunch of data that is replicated 3 times in the report.
Cons of using CBOR:
- Developer experience + Need specific tooling and libs
Open questions / thoughts:
- By several on the call (see Meeting Notes 8/9): how about starting simple (JSON) and then as the system scales / if concerns arise around processing costs, consider CBOR?
- By @johnivdel: We could do simple for the ad-tech, but we still need to decide what the helper sees (encoding of the encrypted payload) - i.e. maybe serve different report formats for different use cases / recipient?
My 2c (@maudnals)
Cons:
- +1 that CBOR adds complexity and in turn risks making the barrier to experimentation higher; IMO CBOR should be picked only if the costs/encoding wins are undoubtedly worth it (and maybe they are!)
- This also means that browser tooling will have to be implemented early
Other observations (neutral):
- Data point: CBOR is also used in the Trust Tokens API and in the FIDO spec
- A number of implementations are available https://cbor.io/impls.html
About AEAD
See Meeting Notes 8/9
To better understand the processing cost differences, we ran a simple benchmark comparing CBOR and JSON for a sample report. We found similar serialization times for JSON and CBOR (both ~0.5 ms), but CBOR deserialization was almost 4x faster than JSON (~0.3 ms vs ~1.2 ms).