devgrants
devgrants copied to clipboard
FVM - Automated Testing, Fuzzing and Optimisation
RFP Proposal: FVM - Automated Testing, Fuzzing and Optimisation
Name of Project: FVM - Automated Testing (Fuzzing for FVM)
Link to RFP: Please link to the RFP that you are submitting a proposal for.
RFP Category: devtools-libraries for FVM
Proposer: @jacarte
Do you agree to open source all work you do on behalf of this RFP and dual-license under MIT, GPL, and APACHE2 licenses?: Yes
Project Description
Summary and outline
keywords: fuzzing, super-diversification, optimization, identification of compiler timebombs, attack vectors, wasmtime.
This project aims to automatically test and bench the use of wasmtime, consequently, we will also test the Cranelift compiler. Specifically, we expect find combinations of Wasm inputs and wasmtime usage parameters that cause wasmtime/Cranelift to misbehave by:
- Producing unacceptably slow code (with respect to the gas model).
- Consuming an exorbitant amount of memory or time (on compile) with respect to the size of the Wasm bytecode.
- Producing unacceptably large machine code (with respect to the Wasm bytecode).
- Producing invalid native code (that can crash).
This proposal aims to extend the current testing strategies of wasmtime with FVM-related checking. We plan to do so by including exhaustive and focused fuzzing techniques in the current testing of the FVM.
Fuzzing
Fuzzing is an automated testing technique to find crashes. The general idea is that, given the input specification, the fuzzer generates random data to pass to the program and crash it. In our case, the inputs for fuzzing the usage of wasmtime in the FVM are the Wasm binary and the configuration/instantiation of the wasmtime engine.
To correctly address fuzzing, one usually has two criteria: 1) fast generating the inputs to the program to test or 2) generating "smart" inputs for the program to test. You explore all input space as fast as possible in the first one. In the second, you generate structured data targeting specific parts of this input space. We will use the second strategy.
To generate the inputs for FVM fuzzing:
- the input Wasm binary: We will use Wasm binaries that we know are valid end-to-end. In concrete, we will use Wasm binaries that we know are well parsed and used in the FVM. To fuzz the FVM engine and how wasmtime is used, we will generate several Wasm programs out of the valid ones. We will create two specific modes for the generated Wasm binaries, semantically and non-semantically equivalent. A semantically equivalent program is expected to execute the same result in the FVM when it is executed. If not, then it should be reported as an undesired behavioral binary. In our cases, for example, if the generated machine code by cranelift is larger than expected. Notice that, the goal is the size of the machine code itself. Therefore. generating small Wasm programs that generate large machine codes than the original should not be correct. On the other hand, if one of the generated binaries behaves "better", we potentially report it as an "improvement”. Therefore, this methodology could discover missed optimizations. To generate the semantically equivalent programs, we will use the tool wasm-mutate. One team member of this proposal worked in this tool for Fastly. This tool can generate a handful of semantically equivalent programs out of one single binary, making it feasible for our purposes. A non-semantically generated variant will be used for traditional fuzzing campaigns in which the underlying compiler is used. This mode finds unexpected behaviors in the compilers.
~~-the configuration/instantiation of wasmtime: The parameters of the instantiation of wasmtime will be randomly generated.~~
Correctness Oracles explained
We use the term "Crash" in a generic and subjective way compared to traditional definitions. Milestones 1, 2, 3 and 4 are instantiations of our definition of "crash" as oracles. We plan to create a generic architecture to define different oracles (4 in this proposal) to stress the usage of wasmtime.
Potential users of the architecture should be able to easily define oracles as fuzz targets. Besides the usage of the oracle should be easily integrated to libfuzzer. Each fuzz target should be deterministic, i.e. each run with the same parameters and the same random seed should provide the same result/crash.
An oracle would simply return true|false, depending of the "crashing" criterion. For example, the size of the generated machine code for platform X is larger than Y. (Notice that the returning might change in future implementation, true|false is the simplest information we could get from an oracle). Each oracle should be created as a fuzz-target.
Oracles 1,2,3 and 4 can be categorized as static and dynamic. For example, for oracles 1 and 2, we won't need to execute the actual Wasm binary and the oracles could check the correctness faster than for oracles 3 and 4. Therefore, we sort the order of the oracles to be implemented regarding its complexity. First, we will implement those oracles that involve only static analysis of the code.
Requirements for this implementation
- Access to a dedicated repository in the filecoin git organization.
- Possibility to add the proposed implementation as continuous integration on the repository.
- Being assigned to a member of the FVM team that follows the progress and is available for review / design questions under 24 hours to ensure proper communication.
- Access/Permission to use oss-fuzz project since fuzzing campaigns should be carried in a 24-7 shift.
M1 Milestone - Produce unacceptably slow code (with respect to the gas model)
-
Define oracles for slow code
- Use of static analyses as proxies for it, for example, cyclomatic complexity of the code or size of the code.
- If the Wasm binary takes inputs: use an approach like Kelinci, to fuzz the execution of the binary. This is not done as far as we know. Approaches like SWAM+AFL provide a minimal MVP of the desired tooling.
- This technique will allow us to find inputs of the Wasm binary that executes slower than usual.
- Implement a sandbox mechanism if the Wasm binary will execute.
-
Use of "superoptimization" and "reduction" to:
- "Improve" Wasm code preventing the unneeded spending of gas.
Deliverable
- 24/7 fuzzing service with the ability of including more oracles.
- Primitive automatic report of oracles violations:
- Email to the team and issue tracker, to begin with. The report should contain the Wasm binary and the configuration used to instantiate/compile with wasmtime.
M2 Milestone - Produce unacceptably large machine code (with respect to the Wasm bytecode)
- Define oracles for unacceptably large machine code
- Take statistics information about well-running Wasm binaries in the FVM.
- This will make the first boundaries of what an acceptable behavior for the Wasm binary execution is.
- Take statistics information about well-running Wasm binaries in the FVM.
- Provide solutions if a large machine code is detected. Solutions will be provided as a semantically equivalent Wasm binary that produces smaller machine code.
- For this purpose, wasm-mutate can be used as a reduction tool. We pass a Wasm binary and, we report a smaller binary that is semantically equivalent.
- Define and implement the feedback mechanism.
Deliverable
- Inclusion of the strategy in the 24-7 fuzzing tool mentioned in M1.
M3 Milestone - Consume an exorbitant amount of memory or time (on compile) with respect to the size of the Wasm bytecode.
- Check traditional parameters
- Check CPU profiling
- Check memory consumption
- Check execution time (code size is not necessarily representative of this)
- Instrument specific components of wasmtime to analyze their behavior:
- A CVE was triggered by an exhaustive use of the spills/reload component in wasmtime. Components like this should be instrumented to check for potential bad behaviors.
- Identify and instrument potential components inside wasmtime.
- Use a fork of wasmtime to instrument the components as the target in this oracle testing.
- A CVE was triggered by an exhaustive use of the spills/reload component in wasmtime. Components like this should be instrumented to check for potential bad behaviors.
Deliverable
- Inclusion of the strategy in the 24-7 fuzzing tool mentioned in M1.
M4 Milestone - Produce invalid native code (that can crash).
- This oracle will simply check if machine code crashes.
- Improve feedback mechanism.
Deliverable
- Inclusion of the strategy in the 24-7 fuzzing tool mentioned in M1.
M5 Milestone - Continuous Fixing and maintanance
From the information provided from the continuous running of the oracles, we will:
-
Fix and report any bugs found in the cranelift compiler.
-
Place structural limits on what we consider to be "valid" Wasm bytecode (e.g., limit deeply recursive structures).
-
Finalize our gas model to take pathological Wasm inputs into account.
Roadmap and implementation timeline:
The proposed timeline below would allow for the development of the discussed features with the integration of a review and community discussion phase by the end of May 2023.
Phase | Dates |
---|---|
Dev M1 | October 20 - November 20 (4 weeks) |
Refactoring | November 20 - November 27 (1 weeks) |
Dev M2 | November 27 - December 20 (3 weeks) |
Dev M3 | Jan 5 - Feb 5 (4 weeks) |
Refactoring | Feb 5 - Feb 14 (1 weeks) |
Dev M4 | Feb 14 - March 20 (5 weeks) |
Dev M5 | March 20 - May 5 (2 weeks) |
Refactoring | May 5 - May 12 (1 weeks) |
Feedback will be done through Github issues on the development repository.
Total: 20h754w + 20h751w + 20h753w + 4h754w + 7520h•1w + 7520h5 + 7520h2w + 7520h*1w = $31 500
We plan 420 total hours for the implementation of this proposal. The total amount of the budget is calculated taking into account a minimum of 20 hours per week. Notice that these 20 hours could be increased, in such a case, the project finalization could be before the planned date and won't be reflected in an increase of the initial budget.
Note: The budget could be paid in FIL after approval.
Maintenance and Upgrade Plans
The main idea is to include more oracles over time. During the integration of more oracles, the system should be updated, if needed.
Team
- Javier Cabrera-Arteaga (@jacarte) [email protected], www.jacarte.me
Experience
- Javier: He has experience in automated testing and working with wasmtime and cranelift, as part of Fastly, developing the wasm-mutate tool. Thus, we have experience on how wasmtime/cranelift works.
I will review this with priority in the next 2,3 days + share with FVM core devs as main users of the tool
Thanks for the proposal! This is almost exactly what we need.
Wasm inputs and wasmtime usage parameters that cause wasmtime/Cranelift to crash by:
ultra-nit: not all of the cases below cause wasmtime to crash.
Fuzzing
We will use Wasm binaries that we know are valid end-to-end.
It would probably be good to generate potentially invalid wasm modules to fuzz the wasm module parser.
A semantically equivalent program is expected to execute the same result in the FVM when it is executed. If not, then it should be reported as an undesired behavioral binary. In our cases, for example, if the generated machine code by cranelift is larger than expected.
I'm not sure if this is sufficient. The goal is to make sure that "small" Wasm programs can't balloon into larger compiled programs. There may be semantically equivalent classes of "small" Wasm programs that all balloon to larger sizes.
On the other hand, if one of the generated binaries behaves "better", we potentially report it as an "improvement".
Are you looking for missed optimizations?
the configuration/instantiation of wasmtime
That isn't really in scope. We care about the specific parameters we've picked (although maybe extending to simd would make sense).
Milestones
- I missed one case: Fuzzing for inputs that crash cranelift itself.
- Can we swap M3 for M1 or M2? The outputs of M3 to inform the gas model for a release in February. The rest of the milestones are critical for the release after that, but we have some time until then.
Thanks @Stebalien. Iterating over now :)
Thanks a lot for the feedback. I made some changes and left a mark here in this comment.
Thanks for the proposal! This is almost exactly what we need.
Wasm inputs and wasmtime usage parameters that cause wasmtime/Cranelift to crash by:
~~> ultra-nit: not all of the cases below cause wasmtime to crash.~~
Fuzzing
We will use Wasm binaries that we know are valid end-to-end.
~~> It would probably be good to generate potentially invalid wasm modules to fuzz the wasm module parser.~~ wasm-mutate allows both modes. We can create non-semantically equivalent modules.
A semantically equivalent program is expected to execute the same result in the FVM when it is executed. If not, then it should be reported as an undesired behavioral binary. In our cases, for example, if the generated machine code by cranelift is larger than expected.
~~> I'm not sure if this is sufficient. The goal is to make sure that "small" Wasm programs can't balloon into larger compiled programs. There may be semantically equivalent classes of "small" Wasm programs that all balloon to larger sizes.~~ Added a comment, explicitly saying that we should prevent this case.
On the other hand, if one of the generated binaries behaves "better", we potentially report it as an "improvement".
~~> Are you looking for missed optimizations?~~ Added an explicit comment on this, yes, we might potentialy discover missed optimizations.
the configuration/instantiation of wasmtime
~~> That isn't really in scope. We care about the specific parameters we've picked (although maybe extending to simd would make sense).~~ Noted, removed from the initial description.
Milestones
> * I missed one case: Fuzzing for inputs that crash cranelift itself. I left it more in the general description. Should we include it as a separete goal ?
~~> * Can we swap M3 for M1 or M2? The outputs of M3 to inform the gas model for a release in February. The rest of the milestones are critical for the release after that, but we have some time until then.~~ Swapped places :)
Fuzzing
Are you looking for missed optimizations?
Added an explicit comment on this, yes, we might potentialy discover missed optimizations.
Got it. Note, that's not a huge priority. But still something worth keeping an eye out for.
Milestones
I left it more in the general description. Should we include it as a separete goal ?
I'm happy with this just being an overall goal.
Swapped places :)
Thanks!
LGTM!
@Stebalien thanks for your time and review. I think we are happy to go towards approval. CC: @raulk if you want to take a look and share comments.
@ErinOCon @realChainLife this is ok to approve now.
@realChainLife - hey, just wanted to let you know that the FVM team approves M1 - Produce unacceptably slow code (with respect to the gas model) from this scope and that we can pay for it âś… @Jacarte - we really appreciate all the work that you did on it. Keep them coming đź‘Ť
cc: @Stebalien @raulk @maciejwitowski for visibility on this approval
Thank all for the support :)