RFP Proposal: FVM - Automated Testing, Fuzzing and Optimisation

Name of Project: FVM - Automated Testing (Fuzzing for FVM)

Link to RFP: Please link to the RFP that you are submitting a proposal for.

RFP Category: devtools-libraries for FVM

Proposer: @jacarte

Do you agree to open source all work you do on behalf of this RFP and dual-license under MIT, GPL, and APACHE2 licenses?: Yes

Project Description

Summary and outline

keywords: fuzzing, super-diversification, optimization, identification of compiler timebombs, attack vectors, wasmtime.

This project aims to automatically test and bench the use of wasmtime, consequently, we will also test the Cranelift compiler. Specifically, we expect find combinations of Wasm inputs and wasmtime usage parameters that cause wasmtime/Cranelift to misbehave by:

Producing unacceptably slow code (with respect to the gas model).
Consuming an exorbitant amount of memory or time (on compile) with respect to the size of the Wasm bytecode.
Producing unacceptably large machine code (with respect to the Wasm bytecode).
Producing invalid native code (that can crash).

This proposal aims to extend the current testing strategies of wasmtime with FVM-related checking. We plan to do so by including exhaustive and focused fuzzing techniques in the current testing of the FVM.

Fuzzing

Fuzzing is an automated testing technique to find crashes. The general idea is that, given the input specification, the fuzzer generates random data to pass to the program and crash it. In our case, the inputs for fuzzing the usage of wasmtime in the FVM are the Wasm binary and the configuration/instantiation of the wasmtime engine.

To correctly address fuzzing, one usually has two criteria: 1) fast generating the inputs to the program to test or 2) generating "smart" inputs for the program to test. You explore all input space as fast as possible in the first one. In the second, you generate structured data targeting specific parts of this input space. We will use the second strategy.

To generate the inputs for FVM fuzzing:

the input Wasm binary: We will use Wasm binaries that we know are valid end-to-end. In concrete, we will use Wasm binaries that we know are well parsed and used in the FVM. To fuzz the FVM engine and how wasmtime is used, we will generate several Wasm programs out of the valid ones. We will create two specific modes for the generated Wasm binaries, semantically and non-semantically equivalent. A semantically equivalent program is expected to execute the same result in the FVM when it is executed. If not, then it should be reported as an undesired behavioral binary. In our cases, for example, if the generated machine code by cranelift is larger than expected. Notice that, the goal is the size of the machine code itself. Therefore. generating small Wasm programs that generate large machine codes than the original should not be correct. On the other hand, if one of the generated binaries behaves "better", we potentially report it as an "improvement”. Therefore, this methodology could discover missed optimizations. To generate the semantically equivalent programs, we will use the tool wasm-mutate. One team member of this proposal worked in this tool for Fastly. This tool can generate a handful of semantically equivalent programs out of one single binary, making it feasible for our purposes. A non-semantically generated variant will be used for traditional fuzzing campaigns in which the underlying compiler is used. This mode finds unexpected behaviors in the compilers.

~~-the configuration/instantiation of wasmtime: The parameters of the instantiation of wasmtime will be randomly generated.~~

Correctness Oracles explained

We use the term "Crash" in a generic and subjective way compared to traditional definitions. Milestones 1, 2, 3 and 4 are instantiations of our definition of "crash" as oracles. We plan to create a generic architecture to define different oracles (4 in this proposal) to stress the usage of wasmtime.

Potential users of the architecture should be able to easily define oracles as fuzz targets. Besides the usage of the oracle should be easily integrated to libfuzzer. Each fuzz target should be deterministic, i.e. each run with the same parameters and the same random seed should provide the same result/crash.

An oracle would simply return true|false, depending of the "crashing" criterion. For example, the size of the generated machine code for platform X is larger than Y. (Notice that the returning might change in future implementation, true|false is the simplest information we could get from an oracle). Each oracle should be created as a fuzz-target.

Oracles 1,2,3 and 4 can be categorized as static and dynamic. For example, for oracles 1 and 2, we won't need to execute the actual Wasm binary and the oracles could check the correctness faster than for oracles 3 and 4. Therefore, we sort the order of the oracles to be implemented regarding its complexity. First, we will implement those oracles that involve only static analysis of the code.

Requirements for this implementation

Access to a dedicated repository in the filecoin git organization.
Possibility to add the proposed implementation as continuous integration on the repository.
Being assigned to a member of the FVM team that follows the progress and is available for review / design questions under 24 hours to ensure proper communication.
Access/Permission to use oss-fuzz project since fuzzing campaigns should be carried in a 24-7 shift.

M1 Milestone - Produce unacceptably slow code (with respect to the gas model)

Define oracles for slow code
- Use of static analyses as proxies for it, for example, cyclomatic complexity of the code or size of the code.
- If the Wasm binary takes inputs: use an approach like Kelinci, to fuzz the execution of the binary. This is not done as far as we know. Approaches like SWAM+AFL provide a minimal MVP of the desired tooling.
  - This technique will allow us to find inputs of the Wasm binary that executes slower than usual.
- Implement a sandbox mechanism if the Wasm binary will execute.
Use of "superoptimization" and "reduction" to:
- "Improve" Wasm code preventing the unneeded spending of gas.

Deliverable

24/7 fuzzing service with the ability of including more oracles.
Primitive automatic report of oracles violations:
- Email to the team and issue tracker, to begin with. The report should contain the Wasm binary and the configuration used to instantiate/compile with wasmtime.

M2 Milestone - Produce unacceptably large machine code (with respect to the Wasm bytecode)

Define oracles for unacceptably large machine code
- Take statistics information about well-running Wasm binaries in the FVM.
  - This will make the first boundaries of what an acceptable behavior for the Wasm binary execution is.
Provide solutions if a large machine code is detected. Solutions will be provided as a semantically equivalent Wasm binary that produces smaller machine code.
- For this purpose, wasm-mutate can be used as a reduction tool. We pass a Wasm binary and, we report a smaller binary that is semantically equivalent.
Define and implement the feedback mechanism.

Deliverable

Inclusion of the strategy in the 24-7 fuzzing tool mentioned in M1.

M3 Milestone - Consume an exorbitant amount of memory or time (on compile) with respect to the size of the Wasm bytecode.

Check traditional parameters
- Check CPU profiling
- Check memory consumption
- Check execution time (code size is not necessarily representative of this)
Instrument specific components of wasmtime to analyze their behavior:
- A CVE was triggered by an exhaustive use of the spills/reload component in wasmtime. Components like this should be instrumented to check for potential bad behaviors.
  - Identify and instrument potential components inside wasmtime.
  - Use a fork of wasmtime to instrument the components as the target in this oracle testing.

Deliverable

Inclusion of the strategy in the 24-7 fuzzing tool mentioned in M1.

M4 Milestone - Produce invalid native code (that can crash).

This oracle will simply check if machine code crashes.
Improve feedback mechanism.

Deliverable

Inclusion of the strategy in the 24-7 fuzzing tool mentioned in M1.

M5 Milestone - Continuous Fixing and maintanance

From the information provided from the continuous running of the oracles, we will:

Fix and report any bugs found in the cranelift compiler.
Place structural limits on what we consider to be "valid" Wasm bytecode (e.g., limit deeply recursive structures).
Finalize our gas model to take pathological Wasm inputs into account.

Roadmap and implementation timeline:

The proposed timeline below would allow for the development of the discussed features with the integration of a review and community discussion phase by the end of May 2023.

Phase	Dates
Dev M1	October 20 - November 20 (4 weeks)
Refactoring	November 20 - November 27 (1 weeks)
Dev M2	November 27 - December 20 (3 weeks)
Dev M3	Jan 5 - Feb 5 (4 weeks)
Refactoring	Feb 5 - Feb 14 (1 weeks)
Dev M4	Feb 14 - March 20 (5 weeks)
Dev M5	March 20 - May 5 (2 weeks)
Refactoring	May 5 - May 12 (1 weeks)

Feedback will be done through Github issues on the development repository.

Total: 20h754w + 20h751w + 20h753w + 4h754w + 7520h•1w + 7520h5 + 7520h2w + 7520h*1w = $31 500

We plan 420 total hours for the implementation of this proposal. The total amount of the budget is calculated taking into account a minimum of 20 hours per week. Notice that these 20 hours could be increased, in such a case, the project finalization could be before the planned date and won't be reflected in an increase of the initial budget.

Note: The budget could be paid in FIL after approval.

Maintenance and Upgrade Plans

The main idea is to include more oracles over time. During the integration of more oracles, the system should be updated, if needed.

Team

Javier Cabrera-Arteaga (@jacarte) [email protected], www.jacarte.me

Experience

Javier: He has experience in automated testing and working with wasmtime and cranelift, as part of Fastly, developing the wasm-mutate tool. Thus, we have experience on how wasmtime/cranelift works.

Sep 09 '22 12:09 Jacarte

I will review this with priority in the next 2,3 days + share with FVM core devs as main users of the tool

Sep 09 '22 13:09 BlocksOnAChain

Thanks for the proposal! This is almost exactly what we need.

Wasm inputs and wasmtime usage parameters that cause wasmtime/Cranelift to crash by:

ultra-nit: not all of the cases below cause wasmtime to crash.

Fuzzing

We will use Wasm binaries that we know are valid end-to-end.

It would probably be good to generate potentially invalid wasm modules to fuzz the wasm module parser.

A semantically equivalent program is expected to execute the same result in the FVM when it is executed. If not, then it should be reported as an undesired behavioral binary. In our cases, for example, if the generated machine code by cranelift is larger than expected.

I'm not sure if this is sufficient. The goal is to make sure that "small" Wasm programs can't balloon into larger compiled programs. There may be semantically equivalent classes of "small" Wasm programs that all balloon to larger sizes.

On the other hand, if one of the generated binaries behaves "better", we potentially report it as an "improvement".

Are you looking for missed optimizations?

the configuration/instantiation of wasmtime

That isn't really in scope. We care about the specific parameters we've picked (although maybe extending to simd would make sense).

Milestones

I missed one case: Fuzzing for inputs that crash cranelift itself.
Can we swap M3 for M1 or M2? The outputs of M3 to inform the gas model for a release in February. The rest of the milestones are critical for the release after that, but we have some time until then.

Sep 17 '22 02:09 Stebalien

Thanks @Stebalien. Iterating over now :)

Sep 17 '22 11:09 Jacarte

Thanks a lot for the feedback. I made some changes and left a mark here in this comment.

Thanks for the proposal! This is almost exactly what we need.

Wasm inputs and wasmtime usage parameters that cause wasmtime/Cranelift to crash by:

~~> ultra-nit: not all of the cases below cause wasmtime to crash.~~

Fuzzing

We will use Wasm binaries that we know are valid end-to-end.

~~> It would probably be good to generate potentially invalid wasm modules to fuzz the wasm module parser.~~ wasm-mutate allows both modes. We can create non-semantically equivalent modules.

A semantically equivalent program is expected to execute the same result in the FVM when it is executed. If not, then it should be reported as an undesired behavioral binary. In our cases, for example, if the generated machine code by cranelift is larger than expected.

~~> I'm not sure if this is sufficient. The goal is to make sure that "small" Wasm programs can't balloon into larger compiled programs. There may be semantically equivalent classes of "small" Wasm programs that all balloon to larger sizes.~~ Added a comment, explicitly saying that we should prevent this case.

On the other hand, if one of the generated binaries behaves "better", we potentially report it as an "improvement".

~~> Are you looking for missed optimizations?~~ Added an explicit comment on this, yes, we might potentialy discover missed optimizations.

the configuration/instantiation of wasmtime

~~> That isn't really in scope. We care about the specific parameters we've picked (although maybe extending to simd would make sense).~~ Noted, removed from the initial description.

Milestones

> * I missed one case: Fuzzing for inputs that crash cranelift itself. I left it more in the general description. Should we include it as a separete goal ?

~~> * Can we swap M3 for M1 or M2? The outputs of M3 to inform the gas model for a release in February. The rest of the milestones are critical for the release after that, but we have some time until then.~~ Swapped places :)

Sep 17 '22 12:09 Jacarte

Fuzzing

Are you looking for missed optimizations?

Added an explicit comment on this, yes, we might potentialy discover missed optimizations.

Got it. Note, that's not a huge priority. But still something worth keeping an eye out for.

Milestones

I left it more in the general description. Should we include it as a separete goal ?

I'm happy with this just being an overall goal.

Swapped places :)

Thanks!

LGTM!

Sep 19 '22 16:09 Stebalien

@Stebalien thanks for your time and review. I think we are happy to go towards approval. CC: @raulk if you want to take a look and share comments.

Sep 20 '22 13:09 BlocksOnAChain

@ErinOCon @realChainLife this is ok to approve now.

Sep 29 '22 12:09 BlocksOnAChain

@realChainLife - hey, just wanted to let you know that the FVM team approves M1 - Produce unacceptably slow code (with respect to the gas model) from this scope and that we can pay for it ✅ @Jacarte - we really appreciate all the work that you did on it. Keep them coming 👍

cc: @Stebalien @raulk @maciejwitowski for visibility on this approval

Dec 07 '22 16:12 BlocksOnAChain

Thank all for the support :)

Dec 07 '22 16:12 Jacarte

devgrants devgrants copied to clipboard

FVM - Automated Testing, Fuzzing and Optimisation

RFP Proposal: FVM - Automated Testing, Fuzzing and Optimisation

Project Description

Summary and outline

Fuzzing

Correctness Oracles explained

Requirements for this implementation

M1 Milestone - Produce unacceptably slow code (with respect to the gas model)

Deliverable

M2 Milestone - Produce unacceptably large machine code (with respect to the Wasm bytecode)

Deliverable

M3 Milestone - Consume an exorbitant amount of memory or time (on compile) with respect to the size of the Wasm bytecode.

Deliverable

M4 Milestone - Produce invalid native code (that can crash).

Deliverable

M5 Milestone - Continuous Fixing and maintanance

Roadmap and implementation timeline:

Maintenance and Upgrade Plans

Team

Experience

Fuzzing

Milestones

Fuzzing

Milestones

Fuzzing

Milestones

devgrants
devgrants copied to clipboard