swift-openapi-generator icon indicating copy to clipboard operation
swift-openapi-generator copied to clipboard

Consider spliting Components into separated files

Open bartoszirzyk opened this issue 1 year ago • 11 comments

Hi, I've got an openapi.yaml file which has 64k lines. Generator produces Types.swift with 135k lines. It's a big file, hard to navigate and scroll through. It'd be nice to have each Schemas element in separated file and maybe other generated types. Probably it's not what we want as a default behaviour but having a feature flag to enable such thing would be appreciated.

bartoszirzyk avatar Sep 08 '23 08:09 bartoszirzyk

Hi @bartoszirzyk - the current behavior is mainly driven by the fact that the recommended way to use the generator is through the build plugin, and the build plugin requires the name of the output files to be known before the OpenAPI document is parsed (due to how SwiftPM plugins work). So unfortunately I don't really see this happening until SwiftPM relaxes that requirement.

Out of curiosity, when do you find you need to look at the generated code, as opposed to looking at the source OpenAPI document and then using autocompletion? The generated code is considered an implementation detail that adopters should have to look at much.

czechboy0 avatar Sep 08 '23 08:09 czechboy0

FWIW I think this is a reasonable ask. Experimentally, it seems that the compiler is having a much harder time with one large file (cf. more, smaller files). An example on my machine shows pretty substantial improvement:

% wc -l Sources/GithubTypesOneFile/*
457133 Sources/GithubTypesOneFile/Types.swift

% swift package clean && swift package update && swift build --skip-update --target GithubTypesOneFile
Build complete! (647.71s)
% wc -l Sources/GithubTypesMajorSplit/*
 10492 Sources/GithubTypesMajorSplit/APIProtocol.swift
312924 Sources/GithubTypesMajorSplit/Components.swift
133735 Sources/GithubTypesMajorSplit/Operations.swift
    15 Sources/GithubTypesMajorSplit/Servers.swift
457166 total

% swift package clean && swift package update && swift build --skip-update --target GithubTypesMajorSplit
Build complete! (412.02s)
% wc -l Sources/GithubTypesComponentsSplit/*
 10492 Sources/GithubTypesComponentsSplit/APIProtocol.swift
    13 Sources/GithubTypesComponentsSplit/Components.swift
    30 Sources/GithubTypesComponentsSplit/Headers.swift
133735 Sources/GithubTypesComponentsSplit/Operations.swift
   657 Sources/GithubTypesComponentsSplit/Parameters.swift
    15 Sources/GithubTypesComponentsSplit/RequestBodies.swift
   491 Sources/GithubTypesComponentsSplit/Responses.swift
311782 Sources/GithubTypesComponentsSplit/Schemas.swift
    15 Sources/GithubTypesComponentsSplit/Servers.swift
457230 total

% swift package clean && swift package update && swift build --skip-update --target GithubTypesComponentsSplit
Build complete! (422.18)
% wc -l Sources/GithubTypesSchemasSplit/*
 10492 Sources/GithubTypesSchemasSplit/APIProtocol.swift
    13 Sources/GithubTypesSchemasSplit/Components.swift
    30 Sources/GithubTypesSchemasSplit/Headers.swift
133735 Sources/GithubTypesSchemasSplit/Operations.swift
   657 Sources/GithubTypesSchemasSplit/Parameters.swift
    15 Sources/GithubTypesSchemasSplit/RequestBodies.swift
   491 Sources/GithubTypesSchemasSplit/Responses.swift
100277 Sources/GithubTypesSchemasSplit/Schemas.1.swift
100545 Sources/GithubTypesSchemasSplit/Schemas.2.swift
110984 Sources/GithubTypesSchemasSplit/Schemas.3.swift
    15 Sources/GithubTypesSchemasSplit/Schemas.swift
    15 Sources/GithubTypesSchemasSplit/Servers.swift
457269 total

% swift package clean && swift package update && swift build --skip-update --target GithubTypesSchemasSplit
Build complete! (140.95s)
% wc -l Sources/GithubTypes50kSplit/*
 10492 Sources/GithubTypes50kSplit/APIProtocol.swift
    13 Sources/GithubTypes50kSplit/Components.swift
    30 Sources/GithubTypes50kSplit/Headers.swift
 50029 Sources/GithubTypes50kSplit/Operations.1.swift
 50135 Sources/GithubTypes50kSplit/Operations.2.swift
 33596 Sources/GithubTypes50kSplit/Operations.3.swift
    13 Sources/GithubTypes50kSplit/Operations.swift
   657 Sources/GithubTypes50kSplit/Parameters.swift
    15 Sources/GithubTypes50kSplit/RequestBodies.swift
   491 Sources/GithubTypes50kSplit/Responses.swift
 50162 Sources/GithubTypes50kSplit/Schemas.1.1.swift
 50128 Sources/GithubTypes50kSplit/Schemas.1.2.swift
 50079 Sources/GithubTypes50kSplit/Schemas.2.1.swift
 50479 Sources/GithubTypes50kSplit/Schemas.2.2.swift
 55450 Sources/GithubTypes50kSplit/Schemas.3.1.swift
 55546 Sources/GithubTypes50kSplit/Schemas.3.2.swift
    15 Sources/GithubTypes50kSplit/Schemas.swift
    15 Sources/GithubTypes50kSplit/Servers.swift
457345 total

% swift package clean && swift package update && swift build --skip-update --target GithubTypes50kSplit
Build complete! (116.46s)

I think we have other features we can explore too (e.g. only generating the bits that users want to use), which will have much bigger impact, but I think we should keep this on the backlog.

simonjbeaumont avatar Sep 21 '23 15:09 simonjbeaumont

Renaming as we should really try hard not to make this a config option.

czechboy0 avatar Sep 21 '23 18:09 czechboy0

I'd love to see this — I setup the GitHub OpenAPI YAML spec in a standalone project, and the generated files took over 23 minutes to compile on an 16 core M3 Max with 64Gb RAM (2 of the performance cores were completely pegged, while the others did nothing).

I ended up needing to abandon the use of the Swift OpenAPI Generator packages for my project due to this issue.

Splitting the files up should result in dramatic compilation speedups under current Swift releases.

tonyarnold avatar Dec 02 '23 11:12 tonyarnold

@tonyarnold this is certainly something we can revisit post-1.0 as I believe our stance on the number of files, their names, and contents is not considered API.

In the meantime, have you tried adding a filter to the config file. You can read more about it here: https://swiftpackageindex.com/apple/swift-openapi-generator/0.3.5/documentation/swift-openapi-generator/soar-0008.

It even includes some examples using the GitHub API.

simonjbeaumont avatar Dec 02 '23 12:12 simonjbeaumont

Any update regarding this?

When generating a client for the Stripe API, the Types.swift file exceeds 24MB. This cause huge troubles on at least two levels that I can identify:

  • compilation time (as mentioned above). It makes me wonder why bother using Mx Ultra Macs if you need 20 minutes to build a single file
  • memory usage: compiling this single file required more than 9GB of memory.

Beside behind a bit ridiculous 😅 it gets costly as well since build machines need to be provisioned with high amounts of RAM and run for longer than they would need otherwise. Imagine generating two different API client targets in the same SPM ?

Instead of having Types.swift, would it be acceptable to have this kind of generated output?

  • APIProtocol.swift
  • Servers.swift
  • Components.<EachTypeName>.swift (or a Components folder with the files inside)
  • Operations.<EachOperationName>.swift (or an Operations folder with the files inside)

I'd be glad to work on this if the given solution is accepted by the project maintainers.

abidon avatar Sep 06 '24 15:09 abidon

Instead of having Types.swift, would it be acceptable to have this kind of generated output?

  • APIProtocol.swift
  • Servers.swift
  • Components.<EachTypeName>.swift (or a Components folder with the files inside)
  • Operations.<EachOperationName>.swift (or an Operations folder with the files inside)

Sadly no, because the Swift package plugin contract mandates that you declare exactly what files are being generated ahead of time as output of the plugin.

This is why the above experiments used numbered based approach.

I'd be glad to work on this if the given solution is accepted by the project maintainers.

That's awesome to hear and I think it would be very welcome news to our adopters to have a solution. I think we need to think exactly what options we have here.

I did some breakdown above in splitting the files across different dimensions, but ones that are deterministic. It had some favourable results for that API. I wonder if you want to run that against the Stripe API and see if it yields acceptable results there too.

Note also, that this has been less of an issue for many adopters since we introduced filtering as a pre-generation step. But I accept that, if you really need to use the whole API, this is not an option :)

simonjbeaumont avatar Sep 06 '24 15:09 simonjbeaumont

Agreed with Si, but I'd really encourage you to consider filtering. Generating the whole API is rarely the right solution, only really when you're either implementing the service (server generation), or if you're truly using 100% of the operations (very rare). In both cases though, you can use filtering eg by tag to break the operations into individual modules to keep things reasonably sized.

czechboy0 avatar Sep 06 '24 16:09 czechboy0

In both cases though, you can use filtering eg by tag to break the operations into individual modules to keep things reasonably sized.

That would be nice if only the Stripe OpenAPI spec had tags 😅 but I know there are other ways to filter as well (by operation, by path, ...)

Sadly no, because the Swift package plugin contract mandates that you declare exactly what files are being generated ahead of time as output of the plugin.

Ok, that's a critical piece of information I was unaware of! 💡

I saw some lib the other day (can't remember which) that was probably code generated as well, and that used an alphabetical approach, such as Types_A.swift, Types_B.swift, ... that could be one way to go with the limitation while remaining deterministic. There are still some cases that wouldn't be best covered if there are some namespacing in the types. Or for operation names as well as they often start with a limited set of verbs (such as Create, Get). Anyway it would still speed up compilation, even if some files such as Operations_C.swift or Operations_G.swift are bigger than Operations_Z.swift (that would probably be empty most of the time!).

I will think about some alternative next week after documenting myself about compiler plugins limitations and diving into the plugin source code.

abidon avatar Sep 07 '24 10:09 abidon

Yes, splitting files across a deterministic number of files could work, and accept that for small amounts of generated code, many of these will be empty. We already have that today: we generate all three of Types.swift, Client.swift, and Server.swift, regardless of the generation mode and generate empty files if that mode hasn't been requested.

The challenge will be working out how and when to split. Maybe we just partition the generation space over a dimension, e.g. operations, and evenly distribute over our set of files. That won't result in an even distribution of generated code because some operations will result in substantially more code than others. But it might be a good start and we'd be free to tweak the heuristic without breaking any API, e.g. by weighting the operations by the number of parameters.

As part of this exercise we should try and be data-driven as to the ideal size of file and, therefore, the number of files we need. For example, it would be great to plot compile time with file size and see how linear it is and/or where it starts to degrade beyond linear.

simonjbeaumont avatar Sep 07 '24 16:09 simonjbeaumont

Hi @simonjbeaumont, it's me again 😅

After diving into the code of the project, I have a question regarding one of your previous statements:

[...] the Swift package plugin contract mandates that you declare exactly what files are being generated ahead of time as output of the plugin.

From my understanding, you were referring to this part of the code, where we need to compute the value of outputFiles which is later passed to the build command will proceed to the generation of the Swift files:

https://github.com/apple/swift-openapi-generator/blob/7c36ba905f6c0c9631d86b50b2b58e241bcb1837/Plugins/OpenAPIGenerator/plugin.swift#L24-L40

While it would probably require a bigger refactoring, I don't understand how the first idea I submitted (which is generating Components.<EachComponentType>.swift and Operations.<EachOperationType>.swift) wouldn't be possible? Except maybe if opening files is part of some sandbox-restricted operation?

Yes, it would generate a huge amount of Swift files for OpenAPI documents such as the one from Stripe or GitHub, but it would also keep the distribution even.

Yes, we would increase the number of times we pay the initialization of an instance of the Swift compiler, is it that high tho?


I come back to this idea because – while I understand filtering is a working solution for some projects – it doesn't work for us sadly. Right now our compilation time is above 5 minutes and we haven't yet finished the Stripe integration in our product, which means we will probably add more operations to the filtered list in the upcoming days.

As a workaround, I was tempted to create multiple SPM targets for each "group of requests" that make sense together (let's say one target for each tag if only Stripe's OpenAPI had specified tags), but it's inconvenient on other levels:

  • Using the generated Clients from each targets will require namespacing (a.k.a. StripeCustomer.Client) to prevent ambiguity (since, from what I saw, the generated client typename isn't configurable)
  • Generating some "shared" components multiple times (such as Components.Schema.card, Components.Schema.customer) which would cause ambiguity as well (same names but different modules, distinct types that are not interchangeable)
  • The bootstrapping effort

What are your thoughts on the matter ?

abidon avatar Sep 11 '24 13:09 abidon

Generating some "shared" components multiple times (such as Components.Schema.card, Components.Schema.customer) which would cause ambiguity as well (same names but different modules, distinct types that are not interchangeable)

That's a good point, and could be helped if we implemented #25.

As a workaround, I was tempted to create multiple SPM targets for each "group of requests" that make sense together (let's say one target for each tag if only Stripe's OpenAPI had specified tags)

That's what I'd recommend in the meantime. Large projects like GitHub and Kubernetes often split APIs into groups by tags like this.

czechboy0 avatar Oct 04 '24 07:10 czechboy0

@abidon First off, sorry I missed your other reply.

While it would probably require a bigger refactoring, I don't understand how the first idea I submitted (which is generating Components.<EachComponentType>.swift and Operations.<EachOperationType>.swift) wouldn't be possible? Except maybe if opening files is part of some sandbox-restricted operation?

To help me understand, are you proposing we do a quick parse of the document and dynamically determine the output files based on the content? Even if it is possible, my current understanding is that the build plugin, at this point of generating the build graph for SwiftPM/Xcode, is supposed to be very cheap and not do a bunch of work.

I'd be glad to work on this if the given solution is accepted by the project maintainers.

@czechboy0 might have opinions but I imagine we'd be OK with some form of deterministic splitting so long as we can agree on what the split is in some data-driven way?

simonjbeaumont avatar Oct 04 '24 13:10 simonjbeaumont

we'd be OK with some form of deterministic splitting so long as we can agree on what the split is in some data-driven way?

Yes, absolutely. Just keep in mind the limitation mentioned by Si: "the build plugin, at this point of generating the build graph for SwiftPM/Xcode, is supposed to be very cheap and not do a bunch of work" - even further, the build plugin is meant to construct just the build command, without actually reading the input files. That allows the build system to not have to rerun the plugin code itself again, it'll only rerun the build command (the invocation of our CLI) that takes the two YAML files as input, and produces the outputs.

That's why we can't really vary the number/names of outputs based on the file inputs.

czechboy0 avatar Oct 04 '24 13:10 czechboy0

we'd be OK with some form of deterministic splitting so long as we can agree on what the split is

Well given the original point made by @czechboy0

The generated code is considered an implementation detail that adopters should have to look at much.

Does it, or should it, matter that much how it's split. Isn't the more important thing that there's n number of files available for the generator to use?

the build plugin is meant to construct just the build command, without actually reading the input files.

Could it go really dumb with the approach and read the file size of the Open API document and make assumptions that the bigger the document the more files required? It's a bit arbitrary, but at least reading the file size is fast.

The generator would need to have the ability to work with whatever it is given. So it very much would need to be a best effort to split across the files its been supplied. Whether it's by tag, if available, or based on the experiment that @simonjbeaumont did (https://github.com/apple/swift-openapi-generator/issues/253#issuecomment-1729861562) it looks like schemas is probably the better split.

That allows the build system to not have to rerun the plugin code itself again, it'll only rerun the build command

I expected it would be similar to the build phases, where if the input doesn't change then it doesn't run anything again, but if the input changes it runs script. Because surely if the plugin definitions are truely expected to be inexpensive then not running the plugin again is an optimisation doesn't provide any value and just needlessly limits what the plugin can do/optimise itself between changes in inputs.

theoriginalbit avatar Oct 04 '24 22:10 theoriginalbit

I don't believe we can use the size either, because the principle is that the input file can't affect the number/names of the output files in any way. Otherwise we might introduce an inconsistency to the build process, as if you changed the size of the OpenAPI document, it wouldn't rerun the plugin script, only the underlying CLI. But if you restarted Xcode, for example, then it'd take effect and produce a different number of files.

This is why we always need to produce all 3 files (types, client, server), and we keep some of them empty when that mode isn't requested. But we can't vary the output files based on either the OpenAPI document, or the config file.

czechboy0 avatar Oct 05 '24 06:10 czechboy0

If we really wanted this ability to have dynamic behaviour in the plugin, we'd need to use a prebuild command instead of a build command, but I think that will mean it's run on every build, which is probably not what we want (e.g. it will run when the OpenAPI document hasn't been touched):

Build tool plugins

Build tool plugins are invoked before a package is built in order to construct command invocations to run as part of the build. There are two kinds of commands that a build tool plugin can return:

  • prebuild commands — are run before the build starts and can generate an arbitrary number of output files with names that can't be predicted before running the command
  • build commands — are incorporated into the build system's dependency graph and will run at the appropriate time during the build based on the existence and timestamps of their predefined inputs and outputs

Build commands are preferred over prebuild commands when the paths of all of the inputs and outputs are known before the command runs, since they allow the build system to more efficiently decide when they should be run. ... Prebuild commands should be used only when the names of the outputs are not known until the tool is run — this is the case if the contents of the input files (as opposed to just their names) determines the number and names of the output files. Prebuild commands have to run before every build, and should therefore do their own caching to do as little work as possible to avoid slowing down incremental builds. ...

— source: https://github.com/swiftlang/swift-package-manager/blob/main/Documentation/Plugins.md

The guidance above suggests it's possible, but we'd need to do our own caching. Maybe that's worth investigating. We can always check the hash of the document and config file.

simonjbeaumont avatar Oct 07 '24 07:10 simonjbeaumont

Apart from performance considerations, why wouldn’t it be what we want?

Frizlab avatar Oct 07 '24 07:10 Frizlab

Apart from performance considerations, why wouldn’t it be what we want?

I expect that is the main consideration here. The whole point of this issue was because, for large OpenAPI documents, the generation and compilation can be a problem.

If we think we can do the prebuild command with very cheap cache checking then it might be fine. Although it still won't be quite as efficient as the build system doing its timestamp based checks because, even if that's all we did, it's gonna have to fork our process to do that.

simonjbeaumont avatar Oct 07 '24 08:10 simonjbeaumont

If we investigate this, we need to make sure that the build system doesn't needlessly rerun commands that depend on our command's output.

To be frank, it might be worthwhile to open a thread on the SwiftPM repo itself and discuss there. This is something that I imagine many build plugins would want, so we can pool resources.

czechboy0 avatar Oct 07 '24 08:10 czechboy0

If we investigate this, we need to make sure that the build system doesn't needlessly rerun commands that depend on our command's output.

I think that if our command doesn't touch (read: update the timestamp) any of the outputs, it won't. But I agree it's worth an investigation.

simonjbeaumont avatar Oct 07 '24 08:10 simonjbeaumont

What'd be a shame is that even if nothing changed, any project that includes Swift OpenAPI Generator would suddenly lose truly cheap "no-op" builds, and would gain at least one extra process launch in the critical path. One for each OpenAPI document in their full dependency graph. It might not be significant, but this is where I'm coming from and why I'd like to exhaust all other paths before we do this.

czechboy0 avatar Oct 07 '24 08:10 czechboy0

any project that includes Swift OpenAPI Generator would suddenly lose truly cheap "no-op" builds, and would gain at least one extra process launch in the critical path. ... I'd like to exhaust all other paths before we do this.

We should be realistic about how cheap this truly cheap "no-op" build is. I imagine that it's probably doing a bunch of things. I'd also like for us to measure what the impact of forking a process that just checks the timestamps of two files is and writing that out to a build file somewhere and exiting early if that resulted in any change. I'd imagine that both build systems are doing a lot of things in a "no-op" build. If this additional check is negligible it could be justifiable.

My position is that what we have now is not scaling well for very large documents and we have a number of strategies to play with here to make that more acceptable for these adopters and I don't think we're ready to rule anything out just yet.

One thing to consider about dynamically named outputs is it might actually help with incremental builds. For example, imagine a solution where the files were split by tag (I don't think this is a good idea for other reasons, but bear with me)... a change to an OpenAPI document might actually not need to touch the output files for all the tag-based outputs that are not related to the change. Compare this to now where any change, no matter how small or targeted, results in the recompilation of all types because they share a file.

But, as a pragmatic next step, the least controversial solution is the hard-coded split across any dimensions we can pick.

simonjbeaumont avatar Oct 07 '24 09:10 simonjbeaumont

Yup agreed all around. We can continue to investigate a hand-rolled solution here, but should really also start a thread on the SwiftPM repo. Again, this is not a unique problem to us, we're just ahead of the curve and experiencing the limitations.

czechboy0 avatar Oct 07 '24 10:10 czechboy0