terraform-ls Lazy-load embedded provider schemas to improve performance

Background

As documented in https://github.com/hashicorp/terraform-ls/blob/main/docs/benchmarks.md#memory-usage we currently load 1 big blob of JSON with all ~200 providers into memory, which makes up the majority of the ~300MB of memory.

https://github.com/hashicorp/terraform-ls/blob/81b49a174bd8f34a78ab959f7ab88325177b1daf/internal/schemas/gen/gen.go#L163-L180

Aside from memory usage, it also impacts the initialize response time - i.e. time it takes for the server to become ready for other requests. It currently takes 90+% of the 2 seconds.

https://github.com/hashicorp/terraform-ls/blob/4308af82997298fc051f7f9efaf4866aab60178e/internal/langserver/handlers/service.go#L510-L513

https://github.com/hashicorp/terraform-ls/issues/986 and similar issues people reported about memory usage before
https://github.com/hashicorp/terraform-ls/issues/517
https://github.com/hashicorp/terraform-ls/issues/506
https://github.com/hashicorp/terraform-ls/issues/645
https://github.com/hashicorp/terraform-ls/issues/828
https://github.com/hashicorp/terraform-ls/issues/193
https://github.com/hashicorp/terraform-ls/issues/150

Proposal

There's no exact proposal yet, but there are early ideas we have discussed within the team:

breaking down the single JSON into 1 per provider
lazily-loading the schema into memdb, i.e. loading it only if/when we parse the relevant provider constraint
distributing JSON files alongside the LS binary, as opposed to embedding files into it

Each area comes with some different trade-offs which may need to be considered alongside the issues referenced above.

Jul 06 '22 19:07 radeksimko

Based on various sources and experimenting it appears that:

any files embedded via embed are only allocated into memory when actually used/needed
the memory allocated as a result of this does get garbage collected

In my experiment, I tried to load ~70MiB JSON file with schemas into memdb and then delete it. This resulted in allocation of 198MiB, which dropped back to zero after garbage collection. The garbage collector acted automatically in ~2mins after deletion, but I was also able to force it via runtime.GC().

Crucially, the memory usage/profile remains the same regardless of whether the file is loaded via os from disk or from embed FS. Therefore we shouldn't need to distribute separate JSON files alongside the binary. This means we don't have to worry about distribution problems in various packaging systems, such as Homebrew or APT/DNF etc. in Linux.

Furthermore I can confirm that breaking down the JSON would have positive impact on peak memory usage. When I tried to store just 1 small part of the JSON into memdb from the ~70MiB JSON, the whole JSON had to be parsed and allocated 1st. The memory was later garbage collected, but loading just the necessary data from a smaller JSON means less pressure on the GC, lower peak memory usage and likely also less CPU spent on parsing the JSON data repeatedly.

Sep 12 '22 17:09 radeksimko

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Nov 04 '22 03:11 github-actions[bot]