neon icon indicating copy to clipboard operation
neon copied to clipboard

Include layer file sizes in `index_part.json`

Open hlinnaka opened this issue 3 years ago • 2 comments

Once we implement on-demand download of layer files, the "physical size" of a project that we report will no longer include remote layers that are not present in the pageserver. But the console would need to get the total size of the project on S3.

Include the size of each layer file in the index.json file, so that we know the size of remote layers in the pageserver. Must be careful to make the change in a backwards-compatible fashion, so that the new pageserver version can still read old index_part.json files that are missing the layer sizes.

hlinnaka avatar Sep 15 '22 14:09 hlinnaka

If I understood the proposal right, we're adding a new file? Could name it timeline.json, for better clarity. Might also extend the name with version number, or similar metadata.

We can consider adding two more things into the updated json file:

  • versioning info, that's currently missing (metadata is embedded inside as bytes, so that version has to be taken into account or the approach reworked)
  • file checksums (at least, as a placeholder to do the format changes once) to later fill with the values, compatible with what we have in AWS S3

SomeoneToIgnore avatar Sep 15 '22 17:09 SomeoneToIgnore

If I understood the proposal right, we're adding a new file?

I was thinking of just adding new fields to the the old file.

Could name it timeline.json, for better clarity.

Sure, we could rename it. Just have to deal with backwards-compatibility.

Might also extend the name with version number, or similar metadata.

We can consider adding two more things into the updated json file:

  • versioning info, that's currently missing (metadata is embedded inside as bytes, so that version has to be taken into account or the approach reworked)

  • file checksums (at least, as a placeholder to do the format changes once) to later fill with the values, compatible with what we have in AWS S3

+1

hlinnaka avatar Sep 19 '22 20:09 hlinnaka

versioning info, that's currently missing (metadata is embedded inside as bytes, so that version has to be taken into account or the approach reworked)

During the PR I failed to consider this at all and was thinking the two were allowed to evolve separately. If they need to be in sync, the PR can be reworked to receive the version tag from the TimelineMetadataHeader instead of index_json.json "internal tagging", requiring trying out different paths to read the TimelineMetadata.

file checksums (at least, as a placeholder to do the format changes once) to later fill with the values, compatible with what we have in AWS S3)

I understand that this was supposed to be SHA-256 but as far as I can see, there's none at the moment, looking at libs/remote_storage. This is probably because trailer headers are not fully supported for http/1.1 nor does rusoto support sending them?

EDIT: confirmed with https://github.com/neondatabase/neon/issues/987#issuecomment-1146778871

koivunej avatar Oct 05 '22 10:10 koivunej

Having now taken 1.5 stabs at this and starting to understand the context in code, I think the best way forwards would be to lean on the softness of JSON for future index_part.json, but I still haven't considered renaming the file. The current name is confusing but I am starting to get used to it :)

The softness of JSON, as in we can add fields without any trouble. To add layer sizes into index_part.json, the most effortless non-destructive add would be to introduce a new field in IndexPart, like layer_metadata: HashMap<RelativePath, IndexLayerMetadata>. This field would hold metadata for either timeline_layers or missing_layers. Schema for the two fields would not be changed. (On RemoteIndex similar change would be to change HashSet<PathBuf> fields into HashMap<PathBuf, LayerFileMetadata>.)

Because this is just adding a new non-critical field into the document, rolling back to previous release is fine. If rolled back version does updates to the index_part.json it will lose all of the additional information, which is acceptable at least in the context of this issue (layer sizes).

I intend to close #2561 and open up a new one to illustrate the above idea once I'm done with testing it. The ergonomics with the #2561 approach are quite poor for no real gain I can think of now.

Alternative to above would be a less soft approach with two PRs:

  1. allow reading {timeline_layers,missing_layers} from current [path, ...] as well as future {path: { metadata }, ...} schema
  2. start using the {path: { metadata }, ...} schema

Assuming a deploy for each of the two PRs, it would be possible to rollback from 2nd to 1st, but not 2nd to original binary. Another downside is adding more custom deserialization code, which will be tough to get rid of.

Neither of the approaches really need the version field, and it is mostly informative for debugging. It could also be the $GIT_VERSION, or perhaps "created_by": $GIT_VERSION and "updated_by": $GIT_VERSION could be maintained.

koivunej avatar Oct 07 '22 13:10 koivunej

Oops, reopening as this was the only issue with checksums.

koivunej avatar Oct 17 '22 09:10 koivunej

Oops, reopening as this was the only issue with checksums.

Can you create a new issue for that, please? This one is complete, we have the sizes in 'index_part.json` now, right?

hlinnaka avatar Oct 21 '22 17:10 hlinnaka