brew icon indicating copy to clipboard operation
brew copied to clipboard

Improve JSON Loading Performance

Open apainintheneck opened this issue 6 months ago • 45 comments

Verification

  • [X] This issue's title and/or description do not reference a single formula e.g. brew install wget. If they do, open an issue at https://github.com/Homebrew/homebrew-core/issues/new/choose instead.

Provide a detailed description of the proposed feature

We should try to improve the loading times for JSON data. This is especially relevant because the JSON API has large files that get read into memory whenever a formula or cask is loaded from the API.

$ cd $(brew --cache)
$ ls -FlaSh api/*.json | choose -5:-1
26M Dec 29 21:59 api/formula.jws.json
6.4M Dec 29 21:59 api/cask.jws.json
2.3K Dec 29 21:59 api/cask_tap_migrations.jws.json
2.0K Dec 29 21:59 api/formula_tap_migrations.jws.json

Off the top of my head, I see a few ways that we could potentially improve performance.

  1. Move to a faster JSON library

After @Bo98's excellent work, we're living in the future and everybody's running Ruby 3.1. A side effect of this is that we could potentially include native libraries and ship them with the Ruby 3.1 bottle. This should allow us to explore using libraries like oj which have better JSON parsing performance.

  1. Reduce the size of the JSON files

It's easier to parse smaller files and smaller files mean that fewer objects get allocated when things are parsed. Currently, we don't purge nil values in hashes before creating the JSON files. This leads to a large number of extra hash keys which are all strings and have to be allocated. It also increases the size of the files too which slows down parsing and means that more data has to be shipped across the network.

$ grep -Eo ':null[{,]'  api/formula.jws.json | wc -l
   93785
$ grep -Eo ':null[{,]'  api/cask.jws.json | wc -l
   46265

We're allocating an extra 140k hash keys by not eliminating null values from hashes.

Note: I looked into this a few months ago to see if it would make a big difference from a network file size point of view but it only seemed to change the file size by 100s of kilobytes after getting compressed so I don't expect to see a large benefit there.

  1. Stop parsing the API JSON files twice

Currently we essentially parse both of the API JSON files twice. The first pass is used to validate the JSON web signature and the second parses the content. Beyond removing duplicate work, combining these two passes somehow will mean not having to allocate additional large strings that end up being around the same size as the JSON files since the signature header takes up only a small part of the file anyway.

It would also potentially make the JSON a bit smaller since we'd be able to avoid escaping all of the quotes in the stringified JSON.

JSON API payload string:

{"payload":"[\n  {\"name\":\"a2ps\",\"full_name\":\"a2ps\",\"tap\":\"homebrew/core\",...

Escaped quotes:

$ grep -oF '\"'  api/cask.jws.json | head -n 5
\"
\"
\"
\"
\"
$ grep -oF '\"'  api/formula.jws.json | wc -l
 1970788
$ grep -oF '\"'  api/cask.jws.json | wc -l
  537182

What is the motivation for the feature?

Currently there is a fixed cost whenever the API JSON gets loaded into memory that affects all commands that load formulae or casks from the API. On my M1 MacBook it takes around 3/10ths of a second to load the formulae API JSON by itself.

How will the feature be relevant to at least 90% of Homebrew users?

It will make most commands snappier because most users use the JSON API. This is an especially big deal with the work to move brew desc to use the JSON API in https://github.com/Homebrew/brew/issues/16237.

What alternatives to the feature have been considered?

Doing nothing...

apainintheneck avatar Dec 30 '23 06:12 apainintheneck

The problem is twofold a bit. First is the parser speed as you mention but second is the GC impact. The end result of the JSON parse doesn't just add a few thousand objects, it has a footprint of hundreds of thousands of objects. This makes subsequent major GC sweeps quite slow. We can tweak the GC a bit to reduce the frequency of sweeps but ultimately having that large of an object footprint is not ideal.

  1. Move to a faster JSON library

I've looked at oj a bit but don't have numbers yet. It will indeed likely improve parse speed. Not sure about the object problem though.

We do have the problem of native gems here. Realistically it's difficult to enforce native gems installs as there's a million things that can go wrong, as we saw when trying to enable Bootsnap for all and as I continue to sometimes see from an occasional discussion post (e.g. some people have OOM issues running rm during a compile for some reason?).

The thing with alternative parsers however, particularly since the default JSON gem isn't that bad, is they can be optional and what we can do is bundle it with Portable Ruby which 99% of users use and use it if available and fallback if not. This is trivial to do.

  1. Stop parsing the API JSON files twice

There is definitely a difference in speed between the two passes. The first pass is much cheaper than the second in both time taken and objects created. With that said I do acknowledge it's not the most ideal.

The payload being a string was intentional given we were following the JWS standard which defines the payload to be the string. Technically we could roll our own format and not follow the standard that closely. I rolled out the feature very quickly (developed largely in one day) so it was useful at the time to point to the standard in PRs as proof of something that is used and tested in real world applications rather than being something I pulled out of a hat.

It can definitely be compressed to remove spaces however. At the time it was awkward due to Jekyll but this has been made easier since now we control the build more and use v4.

A bigger change is potentially even something not-JSON. In particular something that doesn't require parsing the whole file by allowing us to seek to the relevant section for a particular formula. Such format may solve most problems as we're parsing only a part of the data which has the potential to be cheaper in both time taken and in the number of objects created. This isn't necessarily easy, and there's a several other factors such as whether it's possible to do so with reasonable speed in pure Ruby or if a native gem is necessary. But it's an option at least that I've reflected on a bit.

Bo98 avatar Dec 30 '23 12:12 Bo98

The problem is twofold a bit. First is the parser speed as you mention but second is the GC impact. The end result of the JSON parse doesn't just add a few thousand objects, it has a footprint of hundreds of thousands of objects. This makes subsequent major GC sweeps quite slow. We can tweak the GC a bit to reduce the frequency of sweeps but ultimately having that large of an object footprint is not ideal.

I hadn't realized this but in hindsight it seems obvious. That reminds me that it'd be nice to have memory profiling get added to brew prof at some point.

The thing with alternative parsers however, particularly since the default JSON gem isn't that bad, is they can be optional and what we can do is bundle it with Portable Ruby which 99% of users use and use it if available and fallback if not. This is trivial to do.

That makes sense to me though I'm not sure how complex that bundling would be. At least oj seems to be compatible with the default json library.

There is definitely a difference in speed between the two passes. The first pass is much cheaper than the second in both time taken and objects created. With that said I do acknowledge it's not the most ideal.

Very true. This shows up clearly in the profiling output.

Screen Shot 2023-12-30 at 11 18 58 AM

A bigger change is potentially even something not-JSON. In particular something that doesn't require parsing the whole file by allowing us to seek to the relevant section for a particular formula. Such format may solve most problems as we're parsing only a part of the data which has the potential to be cheaper in both time taken and in the number of objects created. This isn't necessarily easy, and there's a several other factors such as whether it's possible to do so with reasonable speed in pure Ruby or if a native gem is necessary. But it's an option at least that I've reflected on a bit.

Do you have an existing serialization format in mind here? I can't think of any off-the-shelf format that fits those characteristics though I agree that it would obviously be nice to have the ability to not have to load the entire file into memory to just load one formula or cask.

apainintheneck avatar Dec 30 '23 19:12 apainintheneck

I hadn't realized this but in hindsight it seems obvious. That reminds me that it'd be nice to have memory profiling get added to brew prof at some point.

I did try this at some point but never sent a PR with it. Might take a look again.

The garbage collector bar in --stackprof output does point to the time taken by the GC and you can tell the difference between a run with and without JSON parsing.

That makes sense to me though I'm not sure how complex that bundling would be. At least oj seems to be compatible with the default json library.

Ruby does have its own gem bundling system that we could piggyback off of by adding a line to https://github.com/ruby/ruby/blob/master/gems/bundled_gems.

There's also the vendor_ruby directory system that's left entirely up to us what we use it for.

Do you have an existing serialization format in mind here? I can't think of any off-the-shelf format that fits those characteristics though I agree that it would obviously be nice to have the ability to not have to load the entire file into memory to just load one formula or cask.

Not currently because I've not researched it much - I was talking conceptually and in theory we could have a TOC listing all formula names and their offsets that can be very quickly read followed by the actual payload. I know stuff like protobuf has lazy properties but that's not to suggest that's necessarily the perfect fit or not.

It's an option, but there's maybe not too much appetite for change this soon.

Bo98 avatar Dec 30 '23 21:12 Bo98

2. Reduce the size of the JSON files

We should do this regardless. There's a lot of keys that are literally never used by us but yet are downloaded by everyone.

We're allocating an extra 140k hash keys by not eliminating null values from hashes.

This seems like a good idea.

3. Stop parsing the API JSON files twice

If possible: this seems like a good/nice idea.

We do have the problem of native gems here.

We can't/shouldn't really use native gems for anything except developer commands.

The payload being a string was intentional given we were following the JWS standard which defines the payload to be the string.

I wonder about making the payload e.g. Gzipped data?

It's an option, but there's maybe not too much appetite for change this soon.

I don't really have an appetite for here until we try the above ideas and see them insufficient.

I still think the biggest impact will come from reducing the number of keys we actually include in the JSON payload that we use.

MikeMcQuaid avatar Dec 31 '23 17:12 MikeMcQuaid

I wonder about making the payload e.g. Gzipped data?

It can be, though the JSON parse of the actual payload is the slower bit and this won't really help there - only the inital parse.

Bo98 avatar Dec 31 '23 17:12 Bo98

We do have the problem of native gems here.

We can't/shouldn't really use native gems for anything except developer commands.

I assumed that the reason we don't include native gems outside of dev commands was to reduce install time when they need to be updated and avoid problems where native gems refuse to build on some systems. If we essentially roll those gems into the Ruby 3.1 bottle that we're shipping to everyone already, that should solve most of those problems, right? I agree that it should probably not be the first option though since it adds more complexity.

apainintheneck avatar Dec 31 '23 19:12 apainintheneck

Thoughts on Reducing JSON Size

Are we okay with having two different JSON representations of casks/formulae? One would be for loading packages internally and the other for brew info --json=v2 that will also get shown on formulae.brew.sh. This would require some adjustment to the brew generate-*-api commands.

We could probably remove all top-level hash keys that point to blank data. This would be more automatic. At a glance, it looks like most hash keys that we would expect not to be used would be blank anyway like pinned, outdated, etc. The load_formula_from_api method would have to updated to expect more missing hash keys.

Maybe we should store the package data as a hash instead of an array. Internally, we turn the array of package info hashes into a hash of package name to package info every time we load it into memory. I doubt that is slow though but it would require less fiddling.

apainintheneck avatar Dec 31 '23 20:12 apainintheneck

If we essentially roll those gems into the Ruby 3.1 bottle that we're shipping to everyone already, that should solve most of those problems, right?

It would but that will likely dramatically increase how often we need to update this bottle (and users need to download it).

Are we okay with having two different JSON representations of casks/formulae? One would be for loading packages internally and the other for brew info --json=v2 that will also get shown on formulae.brew.sh.

Not just OK but: this seems actively desirable.

We could probably remove all top-level hash keys that point to blank data. This would be more automatic. At a glance, it looks like most hash keys that we would expect not to be used would be blank anyway like pinned, outdated, etc. The load_formula_from_api method would have to updated to expect more missing hash keys.

Seems like a great idea 👍🏻

Maybe we should store the package data as a hash instead of an array. Internally, we turn the array of package info hashes into a hash of package name to package info every time we load it into memory. I doubt that is slow though but it would require less fiddling.

Also a good idea 👍🏻

MikeMcQuaid avatar Jan 01 '24 17:01 MikeMcQuaid

If we essentially roll those gems into the Ruby 3.1 bottle that we're shipping to everyone already, that should solve most of those problems, right?

It would but that will likely dramatically increase how often we need to update this bottle (and users need to download it).

That's what I hadn't considered. Good point!


To add a new slimmer JSON representation we'd need to...

  • [x] Update the code used to load casks and formulae from JSON to allow for missing hash keys.
    • https://github.com/Homebrew/brew/pull/16420
    • Note: We should keep the current tests for loading from the public JSON representation for backwards compatibility.
  • [ ] Update formulae.brew.sh API generation to ignore these new files.
  • [ ] Create the new JSON file in the brew generate-*-api commands.
  • [ ] Update formulae.brew.sh API generation to use these new files.

apainintheneck avatar Jan 01 '24 18:01 apainintheneck

It would but that will likely dramatically increase how often we need to update this bottle (and users need to download it).

It doesn't have to. The update cadences for oj (and bootsnap if we bundle that) are more frequent than Ruby, but very rarely are those updates critical. They can easily wait for the next release. If it's a security release then the release notes will say and we can react, but there hasn't been any in either for as far as I can see back in their changelogs. So realistically: it does not need to increase how often we update just like how Ruby upstream don't ship a new Ruby every time one of their own bundled gems has a new release, unless it's a security update. And the idea here was to share the same model Ruby upstream have with bundled gems.

We already don't automatically ship new Portable Rubies when a dependency updates, and arguably things like OpenSSL are more important than any gem update (though we only use OpenSSL on a very basic level so most vulnerabilities have largely not been a concern).

(On a related note to this: locally I have put together autobumping for the Portable Ruby repository so expect to see a PR for that this month. That will mean the whole repository is 100% automated except for updates between minor versions e.g. 3.1 -> 3.2, and hitting the release button)

Bo98 avatar Jan 01 '24 18:01 Bo98

  • [ ] Update the code used to load casks and formulae from JSON to allow for missing hash keys.

    • Note: We should keep the current tests for loading from the public JSON representation for backwards compatibility.
  • [ ] Update formulae.brew.sh API generation to ignore these new files.

  • [ ] Create the new JSON file in the brew generate-*-api commands.

  • [ ] Update formulae.brew.sh API generation to use these new files.

Makes sense to me 👍🏻

It doesn't have to. The update cadences for oj (and bootsnap if we bundle that) are more frequent than Ruby, but very rarely are those updates critical. They can easily wait for the next release. If it's a security release then the release notes will say and we can react, but there hasn't been any in either for as far as I can see back in their changelogs. So realistically: it does not need to increase how often we update just like how Ruby upstream don't ship a new Ruby every time one of their own bundled gems has a new release, unless it's a security update. And the idea here was to share the same model Ruby upstream have with bundled gems.

Ok, that's fair 👍🏻

(On a related note to this: locally I have put together autobumping for the Portable Ruby repository so expect to see a PR for that this month. That will mean the whole repository is 100% automated except for updates between minor versions e.g. 3.1 -> 3.2, and hitting the release button)

🎉 looking forward to it.

Would be nice for this to be a trivial, automated process.

MikeMcQuaid avatar Jan 01 '24 20:01 MikeMcQuaid

I don't have much to add here but:

  1. I suspect most of the gains will likely come from being able to selectively parse specific formulae/casks from the API JSON (rather than the speed of the JSON parser itself).
  2. I'm still excited by work here in any direction -- feel free to tag me for 👀 on any PRs.

carlocab avatar Jan 03 '24 14:01 carlocab

From quick tests now, while oj is consistently faster it's not as significant enough as I hoped (~0.05s). I think most of the overhead is potentially the number of objects created which didn't really change between the two parsers. It'll be interesting to see the numbers of a slimmed JSON suggestion as that could reduce the object count (for the keys since the values will be a shared nil object).

Vendoring bootsnap in Portable Ruby might still be nice though. Not a deal breaker (~0.12s) but we already put the effort to integrate it in brew to use it so we might as well try to lean more on it rather than something only a couple of maintainers plus a limited amount of our CI uses. Maybe need to figure out smarter cache cleaning though.

Bo98 avatar Jan 03 '24 17:01 Bo98

To demonstrate the object thing. A brew info --json=v2 hello run looks something like this:

120943 objects prior to formula JSON load
673778 objects after formula JSON load
699929 objects prior to cask JSON load
857648 objects after cask JSON load
979820 objects at end of brew run

(exact values may vary from run to run)

So JSON data accounts for about 72.5% of all Ruby objects created in a brew run, even when we use <1% of it.

Forcing a GC (several times to make sure) frees up about 8800 slots after a formula JSON load and 14500 after a cask JSON load.

Disabling the GC on its own (for demonstration purposes) saves ~0.1s, though creating the objects themselves probably have overhead too. Again not night and day but these things quickly add up and pair everything together (even oj) and we're looking at <0.5s from 0.75s.

There's not much to take from this comment - just following up on earlier theories and will be interesting to check back and see what areas improved from trying the various ideas (e.g. JSON slimming) to give an indication of the type of things that are helping more than others. For example, I suspect the whitespace stripping is probably not significant based on the data so far, but we should do it anyway as it's good practice.

Bo98 avatar Jan 03 '24 17:01 Bo98

     Why not split the JSON up into one file per formula or cask? (This may be something to consider further later on down the line, though, depending on how invasive of a refactor or not that might be right now.)

RandomDSdevel avatar Jan 03 '24 18:01 RandomDSdevel

It's a trade-off a bit. Certainly would be slower to download but if you mean instead a single download that is split afterwards then maybe, though the extra write IO cost may be notable.

Bo98 avatar Jan 03 '24 19:01 Bo98

but if you mean instead a single download that is split afterwards then maybe

     (Nods.) One obvious option would be a tarball.

RandomDSdevel avatar Jan 03 '24 20:01 RandomDSdevel

Yeah, that should definitely be an option. I took a quick stab at seeing how much smaller the formula JSON would get by just removing blank values and unused keys and it made a big difference. This, of course, hasn't been tested yet though so it's possible that some of these are still needed.

IGNORABLE_KEYS = %w[
  oldname
  installed
  linked_keg
  pinned
  outdated
  keg_only
  disabled
  deprecated
].freeze

original = Homebrew::API::Formula.all_formulae
File.write "original.json", JSON.generate(original)

slim = original.transform_values do |hash|
  hash = hash
    .reject { |key, _| IGNORABLE_KEYS.include?(key) }
    .reject { |_, value| value.blank? }

  files = hash.dig("bottle", "stable", "files").to_h do |key, hash|
    [key, hash.slice("cellar", "sha256")] # Ignore unused "url"
  end

  hash["bottle"]["stable"]["files"] = files
  hash
end
File.write "slim.json", JSON.generate(slim)
24M original.json
13M slim.json

I was surprised at how much space was taken up by the bottle urls which aren't used at all. That was around 6 megabytes by itself.

{
  "stable": {
    "rebuild": 0,
    "root_url": "https://ghcr.io/v2/homebrew/core",
    "files": {
      "arm64_sonoma": {
        "cellar": "/opt/homebrew/Cellar",
        "url": "https://ghcr.io/v2/homebrew/core/a2ps/blobs/sha256:f6fec638555f04cdcb741cb9d930f95842f603e8d37c6ca5312b25c7a9e585cd",
        "sha256": "f6fec638555f04cdcb741cb9d930f95842f603e8d37c6ca5312b25c7a9e585cd"
      },

Edit: We could also potentially save another half a megabyte by only including the tap info at the top-level of the JSON instead of inside each individual formula definition.

Edit 2: I found that recursively removing blank elements from the formula hashes didn't result in a significantly smaller file.

apainintheneck avatar Jan 04 '24 07:01 apainintheneck

Assuming 7 bottles per formula that's 7 * ~7000 ≈ 50k string objects (probably more from removing the key too) so yeah it might be noticeable on the Ruby side too

Bo98 avatar Jan 04 '24 07:01 Bo98

Edit 2: I found that recursively removing blank elements from the formula hashes didn't result in a significantly smaller file.

It may not slim the file but it will likely reduce the number of required allocations.

MikeMcQuaid avatar Jan 04 '24 15:01 MikeMcQuaid

From quick tests now, while oj is consistently faster it's not as significant enough as I hoped (~0.05s). I think most of the overhead is potentially the number of objects created which didn't really change between the two parsers. It'll be interesting to see the numbers of a slimmed JSON suggestion as that could reduce the object count (for the keys since the values will be a shared nil object).

I was afraid of that.

Vendoring bootsnap in Portable Ruby might still be nice though. Not a deal breaker (~0.12s) but we already put the effort to integrate it in brew to use it so we might as well try to lean more on it rather than something only a couple of maintainers plus a limited amount of our CI uses. Maybe need to figure out smarter cache cleaning though.

I think that sounds like a good idea.

apainintheneck avatar Jan 05 '24 05:01 apainintheneck

Again not night and day but these things quickly add up and pair everything together (even oj) and we're looking at <0.5s from 0.75s.

Looks like that's about the absolute limit. Just for fun for a couple hours I experimented to see what a best case scenario could look like: https://github.com/Bo98/lazyjson-ruby (intentionally doesn't include Hash#except/delete etc)

What we have now: ~988,000 objects, ~0.72s image Proof-of-concept: ~160,000 objects, ~0.53s Screenshot 2024-01-09 at 02 33 13 (on a very basic brew deps hello - so nothing using variations etc which would need adjusting to work on the PoC)

Also note that brew usage without any JSON parsing is ~145k objects, so you can subtract roughly that from the above two numbers (and perhaps even more since I didn't force a GC sweep with the above numbers).

So if we can get anything close to those numbers from what we're doing here we will have done well.

Bo98 avatar Jan 09 '24 02:01 Bo98

Very impressive! I don't think we'll get close to that since I'm not aware of any Ruby gems that offer lazy loading for JSON (though to be honest I haven't looked that hard). Of course, we could use your repo but I doubt that's in the cards either.

apainintheneck avatar Jan 09 '24 05:01 apainintheneck

Yeah there's very likely none as it's sort of a "fake" binding given it doesn't return native Ruby Hashes and Arrays but instead custom ones that don't expose children to Ruby. Idea was make the GC only see one object (+ any values you extract from it) rather than hundreds of thousands. The custom classes mean it isn't a drop-in replacement for all use cases of JSON parsing in the wild (but code adapted to work with it would remain compatible with existing JSON parsing). The lazy part is referring to the Ruby exposure rather than the parsing itself which is not lazy (but really fast given it uses SIMDs like NEON, AVX and SSE). Lazy parsing is possible but only really for streaming and not random access, and is close to being a micro-optimisation when we're in the 2-3% range.

Exploring all this was a nice break from the norm too as I hadn't really tried working with Ruby extensions much before.

Bo98 avatar Jan 09 '24 05:01 Bo98

Based on the discussion in https://github.com/Homebrew/brew/pull/16460 it looks like we want to combine different changes to the JSON API files. The idea being if we're going to break compatibility we might as well just start from scratch.

There are now a few goals here beyond reducing JSON size.

  1. Reduce the size of the JSON file so runtime performance improves.
  2. Combine multiple JSON files to reduce the number of file downloads.
  3. Have the JSON files match the internal data formats.
  4. Combine multiple files so that we only have to download one file per platform.

1. Reduce JSON File Size

I think that this is pretty well sorted out by https://github.com/Homebrew/brew/pull/16460. The cask version needs to be added as well though.

The key is removing fields which aren't actually used when loading packages from the JSON representation without actually changing the information that gets transferred. Which is to say, we were adding in extra fields which could either be generated or weren't used at all.

This is pretty straightforward after updating the API loading code in Formulary and Cask::CaskLoader to handle empty hash fields.

2. Combine JSON Files

I agree with the comments on the PR and in this issue that we should combine related files into the same one. This means that we can combine the tap migrations with the formula or cask file it corresponds to.

  • formulas

    • formula.jws.json
    • formula_migrations.jws.json
  • casks

    • cask.jws.json
    • cask_migrations.jws.json

It doesn't make sense to combine them in my opinion because we often don't want to load casks and formulas at the same time. This is especially true for Linux users.

  • https://github.com/Homebrew/brew/pull/16460#issuecomment-1895710896

3. Match the Internal Data Formats

Currently we rebuild the hashes loaded from the JSON file after loading them because they're not in the format we need them to be in. Since we're changing this file anyway, we could just have it match 1 to 1 with what we want it to look like. This will reduce the amount of extra work we need to do at runtime.

  • https://github.com/Homebrew/brew/pull/16460#issuecomment-1896006399

Before

The payload currently is just an array of hashes where each hash is a representation of a formula or cask.

  • (Array<Hash>)

After

{
  "formulae" => {
    "name" => {...}, # formula definition
    ...
  },
  "aliases" => {
    "alias_name" => "name",
    ...
  },
  "renames" => {
    "old_name" => "name",
    ...
  }
}
{
  "casks" => {
    "token" => {...}, # cask definition
    ...
  },
  "renames" => {
    "old_name" => "name",
    ...
  }
}

4. Combine multiple files per download

This probably only really makes sense on macOS but we could tarball/gzip the cask and formula files and ship them together. Not only will this reduce network requests but it might also reduce overall file size.

  • Linux
    • Formula JSON
  • macOS
    • Tarball
      • Formula JSON
      • Cask JSON

apainintheneck avatar Jan 19 '24 05:01 apainintheneck

Since these changes are much more involved than what was originally proposed, it means that we'll need parallel download and parsing logic for the old and new formats to be added to the api module along with extra work arounds in the Formulary and Cask::CaskLoader modules.

The JSON format itself will need to be versioned as well which should be easy enough to add.

{
  "payload" => ...,
  "signature" => ...,
  "version" => 2,
}

Then, we can parse it based on that version number.

This will be awfully difficult to test beforehand so maybe we should roll it out behind an environment variable until we're confident in it. Essentially, generate the old and new formats in formulae.brew.sh and give them different names and then the environment variable decides which one you try to download.

Since we expect users to move to a newer version of Homebrew pretty regularly, we'd probably be able to remove the first version of the JSON API code as a follow-up after a a month or two.

apainintheneck avatar Jan 19 '24 05:01 apainintheneck

The main difference with this new approach is that it means rewriting a lot of the API code which isn't necessary to just minify the JSON hashes.

apainintheneck avatar Jan 19 '24 05:01 apainintheneck

Essentially, generate the old and new formats in formulae.brew.sh and give them different names and then the environment variable decides which one you try to download.

Yeah a clean break would make this very doable and I agree what you describe here would be the approach I would take - generate both old and new on formulae.brew.sh.

Regardless what we do here, this will be the first time we properly do something like this so it's very uncertain how much issues it may cause.

We did make a minor compatibility break once before. #15168 wasn't in the original API JSON (shipped in 4.0.12 instead) and I accidentally broke compatibility with pre-4.0.12 JSONs in 4.1.0: #15676. I believe this was the cause of reports like https://github.com/orgs/Homebrew/discussions/4701 and https://github.com/orgs/Homebrew/discussions/4750. Though using an old JSON for so long is also a concern - seems to be some permission issue going on somewhere.

The reverse (new JSON incompatible with old brew) has never been tested and the API minifying approach alone would be the first example of doing that given it needs the other PRs that ignore nil-values on the loader/reader side. So the minifying approach isn't necessarily risk-free - it's a different, unexplored type of risk. Frozen CI images (common in Circle CI etc) without the API pre-cached may be good example of such concern, along with multi-user systems. It may work better, it may work worse - I don't really know.

I've leaned more on the clean break approach because we can add any necessary compatibility code we need while the minifying approach is impossible to handle compatibility for should it turn out to be necessary. It is however of course more effort.

Bo98 avatar Jan 19 '24 05:01 Bo98

Regardless what we do here, this will be the first time we properly do something like this so it's very uncertain how much issues it may cause.

Would we need to wait for the next major release here? It's an internal API change but it is breaking. At the very least, it should probably wait for a minor release.

The reverse (new JSON incompatible with old brew) has never been tested and the API minifying approach alone would be the first example of doing that given it needs the other PRs that ignore nil-values on the loader/reader side. So the minifying approach isn't necessarily risk-free - it's a different, unexplored type of risk. Frozen CI images (common in Circle CI etc) without the API pre-cached may be good example of such concern, along with multi-user systems. It may work better, it may work worse - I don't really know.

Yeah, I could very well be underestimating the complexity of the minimized approach. Frozen CI images never even crossed my mind, for example.

I've leaned more on the clean break approach because we can add any necessary compatibility code we need while the minifying approach is impossible to handle compatibility for should it turn out to be necessary. It is however of course more effort.

This is why I was advocating for the other approach. I'm just lazy and this looks like a whole lot more work. I also think that any additional performance benefits will be marginal at best.

apainintheneck avatar Jan 19 '24 08:01 apainintheneck

I think that this is pretty well sorted out by #16460. The cask version needs to be added as well though.

Agreed 👍🏻

This means that we can combine the tap migrations with the formula or cask file it corresponds to.

Yeh, would be good to migrate these two (and leave "room" in the format to potentially add more tap-specific data if needed in future).

It doesn't make sense to combine them in my opinion because we often don't want to load casks and formulas at the same time. This is especially true for Linux users.

Yup, that's fair.

This probably only really makes sense on macOS but we could tarball/gzip the cask and formula files and ship them together. Not only will this reduce network requests but it might also reduce overall file size.

Yeh, worth seeing if this makes things any smaller/faster. If it doesn't meaningfully: can leave it.

Since these changes are much more involved than what was originally proposed, it means that we'll need parallel download and parsing logic for the old and new formats to be added to the api module along with extra work arounds in the Formulary and Cask::CaskLoader modules.

Yes, this is non-ideal but hopefully this logic can be separated enough that it's just a matter if deleting a few if and method definitions. Could add the usual # odeprecated pattern here.

This will be awfully difficult to test beforehand so maybe we should roll it out behind an environment variable until we're confident in it. Essentially, generate the old and new formats in formulae.brew.sh and give them different names and then the environment variable decides which one you try to download.

Good idea, agreed 👍🏻

Would we need to wait for the next major release here? It's an internal API change but it is breaking. At the very least, it should probably wait for a minor release.

Could do this before we remove the generation of the old format.

I'm just lazy and this looks like a whole lot more work.

Lazy is a virtue! This sort of scope creep can be bad, I appreciate you pushing back on it ❤️

The JSON format itself will need to be versioned as well which should be easy enough to add.

This will be v3, FYI.

I think this also makes this a good candidate to be a v3-internal rather than v3 proper; it lets us test out a new format for our own internal use before we can consider moving the formulae.brew.sh APIs to use this format with all data output eventually, too.

MikeMcQuaid avatar Jan 19 '24 08:01 MikeMcQuaid