spin Path towards improved testing (especially end to end testing)

The following is my attempt to understand the testing needs of Spin and sketch a possible path towards improved end to end testing for Spin.

What needs testing

Before we can improve testing we need to know what actually is under test:

Host Runtime

Loading and linking of any WebAssembly binaries built with any version of the Spin SDK or the fermyon:[email protected] world.
Host implementations for all Spin interfaces (both the unversioned Spin world used by Spin 1.x SDKs and the fermyon:[email protected]` world).
- Including any changes to host implementations that might happen due to flags being set through the Spin CLI.
The correct component export is invoked with the correct inputs and the expected outputs are observed.
The Spin runtime produces a "locked application" that can be understood by any Spin compliant runtime.
All versions of the manifest are successfully parsed and result in the runtime being configured correctly.
The Spin CLI error messages (i.e., that when issues occur the correct output is displayed to the user).

Guest Components

The SDKs for all supported languages produce WebAssembly binaries that can be linked to and loaded by the Spin runtime.

Spin related tooling

Additionally there are other Spin related tools that exist outside of producing Spin apps and running them that also need to be tested.

Spin doctor
Spin plugins
Spin watch

The status quo

Currently, Spin testing generally merges many of the above concerns into one single test meaning when a test fails it's often hard to know why. For example, the end-to-end tests not only test that an SDK (usually the Rust SDK but sometimes Go) successfully builds a WebAssembly binary, it then runs that binary against the current Spin runtime and checks for outputs. Which outputs are checked are relatively ad-hoc, and it's therefore very easy to introduce new functionality in the Spin runtime and not test this functionality.

On the guest side, we currently don't test the Python or JavaScript SDKs at all as part of CI and testing against the Go SDK is spotty at best.

Testing system requirements

Any testing system we create needs to fulfill the following needs:

Aim for testing as close to 100% as possible of the functionality listed at the start of this document.
Aim to test as much of the functionality as possible in isolation only integrating when there is a chance that the integration itself might break.
Allow for sharing of tests for common functionality.
- We might want to test some functionality across many different implementations (e.g., two SDKs that both target the same Spin runtime but are for two different languages or two different Spin compliant runtimes). We should make as much of the testing framework easily shareable across implementations as possible only hard coding the Spin runtime located in this repo when the functionality under test is explicitly specific to this runtime (e.g., error messages).
Make tests run as fast as possible.
Make adding tests as easy as possible
Make it easy to see where there are gaps in test coverage
Ensure that functionality tested by separate tests overlaps as little as possible (i.e., if a piece of expected behavior changes, we should minimize the number of tests that need to change).
Make test failures show clear and understandable reasons for failure to ensure quick fixes
Make the testing harness as simple and easy to change as possible.

Possible steps forward

I unfortunately don't have a grand unified vision of exactly how testing should be in the future, but hopefully this document can guide us as we continue to evolve the system. That being said, there are some items that seem like they might be good steps forward for replacing the e2e tests with something more sustainable:

Replace e2e tests with a test suite that works against pre-built WebAssembly components.
- These components would live in their own repo (perhaps pulled in by the Spin repo as a submodule) which could be used by other Spin runtimes for their testing purposes.
- Having pre-built components means that running the tests should be much faster as the component does not need to be built
- This also breaks the testing dependency between the SDK and the runtime so that the runtime can be tested independently of the SDK
- These pre-built components could be written in many languages or all in the same language as the language used to create them is not really a concern to the runtimes being tested. In fact, we may wish to create these components without using an SDK.
Create a "test runtime" for testing SDKs
- Given that the Spin runtime can be tested independently of the SDKs, we can use a "test runtime" that expects the same shape of component as the Spin runtime but only tests that inputs into the runtime and results back from the runtime get translated appropriately. For example, when testing key-value#set, we wouldn't test that the value is actually set in a store, just that the value passed to the set function from the SDK under test correctly makes it to the test runtime and that the SDK sees that correct return value.
- This means that we don't ensure the full e2e story for SDKs and the runtime (which has proven hard and error prone to do in practice), we just ensure that the SDK conforms to the correct component contract and we use the runtime tests described above to ensure that the runtime can execute any component that conforms to that contract. This moves testing from an M:N problem (i.e., M SDKs against N runtimes) to two N:1 problems (i.e., 1 test component to N runtimes and 1 test runtime to N SDKs).
- This would also make testing SDKs much more systematic as the test runtime would test all imports meaning that implementors of new functionality only need to update the test runtime and all SDKs would then receive test coverage. Each SDK would then either implement the new functionality to make the test pass or explicitly opt-out (perhaps because no one is currently available to implement the functionality in some language's SDK). This allows for a much easier overview of the state of each SDK.

Nov 05 '23 13:11 rylev

@rylev This is great! I particularly appreciate your emphasis on granularity, that we get a clear message "you broke SDK 1.3" rather than "test_http_go_works failed"

One tiny thought: it may be implicit in "any version of the SDK" clause, but it might be worth mentioning OCI images produced by earlier versions of Spin.

Nov 05 '23 22:11 itowlson

big +1 to this effort Ryan. This was long overdue. separating test concerns is definetely going to help here.

Nov 06 '23 03:11 rajatjindal

After many PRs aimed at improving runtime testing (e.g., https://github.com/fermyon/spin/pull/2150), we've come a long way with regards to how we test the Spin runtime.

With this in mind I'd like to once again take a step back and do some thinking about how we rationalize testing in Spin.

Status Quo

We currently have 4 different types of testing:

Unit
Runtime
Integration
End to end (aka e2e)

Unit testing

Unit testing is fairly straight forward. This is type of testing that is typically scoped to a local crate to test internal (sometime private) functionality. These types of tests run using the regular Rust mechanism for testing. We should continue with the status quo here.

Runtime Testing

Runtime testing is the newest kind. It is aimed at testing the Spin runtime. In this regard it's like a classic integration test, but it's aim is to only test the runtime. There are many aspects of Spin that the runtime tests simply ignore (i.e., anything that is not runtime behavior). For example:

spin build
Malformed spin.toml files
Any additional Spin tooling (e.g., spin doctor)
The Spin CLI
Loading of components
Triggers

Simply put, the runtime tests assumes a Spin app has been loaded and the app has been triggered somehow. It then ensures that the component runs properly against the Spin runtime.

Integration and e2e tests

These have been put together because it's not currently clear why we have both types. They seem to both be testing Spin end to end with some overlap. For example both seem to test the following (which are not appropriate unit or runtime tests):

Plugins
Common user error scenarios like: what happens when no wasm component is supplied?
The HTTP trigger
spin build

Why we have two mechanisms for testing these things is not entirely clear.

Looking ahead

So given the status quo, it seems the challenge ahead is to somehow either unify integration and e2e testing or come up with some clear reason for when something is either one or the other.

Additionally, such high-level integration tests are hard organize as the testing surface is huge so usually you end up with what seems like a random assortment of tests that don't do a great job of giving us confidence that Spin has good test coverage. We must somehow come up with a better way of organizing our tests such that we can hope to have oversight of our test coverage.

Next steps:

Given all of this, here is my plan for the near term:

Try to extract any pieces of e2e or integration testing that could be better covered by a runtime test
Make it even clearer how runtime tests differ from e2e/integration tests by removing the need for runtime tests to be run against an actual Spin executable
Meet with others to see how they rationalize the differences between e2e and integration testing
See if any of the infrastructure used for runtime tests could be reused in e2e/integration testing (e.g., can runtime test services also be used for e2e tests?)

Dec 21 '23 14:12 rylev