schemathesis icon indicating copy to clipboard operation
schemathesis copied to clipboard

[FEATURE] Bring back 'replay' feature

Open g3rda opened this issue 5 months ago • 9 comments

After fuzzing is performed with Schemathesis, a replay feature would be useful to quickly rerun each/some failed test case from the cassette file. This could enable gathering additional information about each failure, such as system logs, network traffic, CPU usage, and other observables.

Currently, the output includes a reproduce with command, but it is not well-suited for automation.

Somewhat overlaps with #2999 feature request.

Example use case explained in: RoboCon 2024 - Fuzzing for vulnerabilities in REST APIS

g3rda avatar Aug 08 '25 11:08 g3rda

@g3rda, thank you for opening this!

I would also take a look at how other fuzzers do it. For example, libfuzzer can just run <binary> <crash-file> and it will run just that single input. hypothesis outputs an encoded string in a similar fashion. So, usually it only includes one specific crash - would such an approach be helpful?

For example, if Schemathesis creates .schemathesis/crashes and stores artifacts there, then

# Re-run all the crashes
schemathesis replay .schemathesis/crashes
# Re-run a specific crash
schemathesis replay .schemathesis/crashes/filename123

Such files can include multiple calls (for stateful failures) stored in some internal format.

What do you think?

P.S. Originally, I thought that replaying a VCR cassette is something that is not specific to Schemathesis, as the format is well established. I am also not sure if it is sufficient for complete reproduction of test cases

Stranger6667 avatar Aug 08 '25 11:08 Stranger6667

@Stranger6667 that sounds good! A crashes/ folder would be especially helpful for stateful fuzzing with multiple requests per failure.

Two things to consider:

  1. Crash uniqueness in crashes/. In Schemathesis, the cassette includes all test cases (failures and successes), while the report typically shows a single crash per endpoint. Would the crashes/ folder contain every failing case or only one representative per endpoint? Some requests could look similar, but trigger different errors, so I believe it is better to include all.
  2. Reproducibility outside of Schemathesis (#2999). I'm not personally blocked by this, but it would be valuable to rerun test cases without the need to have Schemathesis installed. But even if Schemathesis doesn’t implement it directly, storing each crash as a self-contained file (one file per crash) would make it straightforward for an external utility to execute them.

I like the previous Schemathesis approach of rerunning a request by its ID from the cassette. That leaves cassette analysis to the user (e.g., filter test cases with status FAILURE), but it's simple enough. At the same time, introducing a crashes/ directory would be a clear structural upgrade that aligns the workflow with other fuzzers.

g3rda avatar Aug 09 '25 16:08 g3rda

@g3rda Awesome!

Indeed, as Schemathesis does not have full visibility to distinguish different server errors, it may merge them into a single failure report. There are a bit more details on this matter in #1289 (almost a 4 y.o issue!), but I think that the trade-off is the following:

  • Be more complete and store all failed steps in a single crash report. It means a lower signal/noise ratio as it will include failures that happened during shrinking & retrying a failure.
  • Rely more on shrinking and failure deduplication to produce a more concise crash report with more chances to miss a failure that got into the same bucket as the reported one, but has a different root cause.

I am leaning toward the second option for a few reasons:

  • Failures that happen together are more likely to be reproduced on subsequent runs, as one of them is fixed and Schemathesis replays previously failed examples from the database. It goes in line with the main design around faster iteration vs. trying to find more bugs in Hypothesis (given that finding the next failure appears to have the exponential cost, the choice is justified)
  • Improvements in shrinking & deduplication will make the situation better automatically. While it is harder for black box fuzzers to do so, I think having a special hook would largely solve it. Such a hook can check Sentry/logs before deciding if this 5xx has the same root case as the previously seen ones. HTTP requests from Schemathesis have IDs, so it should be easy to match them. See example below
  • Storing all the failures makes it structurally closer to a cassette, which is already implemented. It feels like that crash reports should not be positioned as a filtered version of a full cassette. More often than not, crash reports contain a single failed "unit". I like to think about it as a minimal set of actions to reproduce the failure - a single call for unit tests & multiple calls for stateful ones.
@schemathesis.hook
def deduplicate_server_error(ctx, case, response):
    id = case.id
    issue = sentry.find_issue("some sentry API query")
    related_logs = logs.read(" + filter them by id of the request. Maybe tracing will really help here")
    # Schemathesis will use it is a key for deduplication
    return hash((issue.kind, related_logs.exception_type))

I was also thinking that we can provide an interface to crashes that are stored in the Hypothesis DB. It stores all the choices made during a test and is basically what we need. However, it is bound to the Hypothesis strategies used in the test, and they will be affected by some schema changes (i.e. it is how Hypothesis interprets this buffer). For this reason, I completely agree that such a format should not be tied to the internal representation, and I also see a lot of value in having external tools that can replay it.

Sorry for the long post, I need to think more about it, especially from the point of view of the workflows that users have. And thank you for sharing your thoughts!

Stranger6667 avatar Aug 09 '25 21:08 Stranger6667

I like the idea of checking Sentry/logs, it adds real system feedback into fuzzing. I feel such a hook could improve the accuracy of reported bugs a lot. The trade-off is setup complexity and, I assume, app/middleware changes.

About the first point, we run Schemathesis in CI in a fresh environment every time, with no shared DB. If a failure isn’t fully captured on the first run, it’s gone. Because of that, I mentioned I prefer reporting everything so nothing is missed. One option could be to add an additional flag that decides if everything gets dumped in the crashes/ or not.

Of course, if the hook does a good job shrinking and deduplicating the crashes, it can be enough and more beneficial than just dumping everything for the user to handle.

g3rda avatar Aug 10 '25 10:08 g3rda

I like the idea of checking Sentry/logs, it adds real system feedback into fuzzing. I feel such a hook could improve the accuracy of reported bugs a lot. The trade-off is setup complexity and, I assume, app/middleware changes.

In general, such an addition will move Schemathesis to the grey-box fuzzers class, and I think it will give many more opportunities to improve data generation too! I'll think about the design for this hook in the next release feature planning - maybe some well-established formats can be included in Schemathesis itself (e.g., communicating with Sentry or other error tracking software)

About the first point, we run Schemathesis in CI in a fresh environment every time, with no shared DB. If a failure isn’t fully captured on the first run, it’s gone. Because of that, I mentioned I prefer reporting everything so nothing is missed. One option could be to add an additional flag that decides if everything gets dumped in the crashes/ or not.

Thanks for sharing more context about your workflow! A config option sounds reasonable to me. I am also thinking if it would be beneficial for debugging to structure crashinfo as failure + related context:

  • minimized failure. Use case - fast debugging, relying on Schemathesis capabilities of test case minimization. I.e., avoiding large and unrelated request parts.
  • "larger" versions of this failure that happened during shrinking. Use case - verify that there were no other failures that could have been missed.
  • API responses.
  • logs/sentry errors. Collected via that hook

Then some of this data can be used for replaying + in a visual report (I want to build a proper visualiser for the data Schemathesis produces).

From the CLI point of view, replaying could be "focused" vs. "full" (names are subjects for bikeshedding):

# Runs only the minimized version
schemathesis replay /path
# Runs the whole sequence of API calls
schemathesis replay --full /path

It would also be nice to make the replayer pluggable (there is already a way to extend CLI) and implement all the report serialization / deserialization logic separately. One of the goals would be to distribute just a single binary to avoid installing Schemathesis just to replay / display the report.

These are more like random thoughts, but I'd like to connect them with real workflows such as yours, so if you have any feedback on those ideas or things that are missing in Schemathesis, I'd be happy to learn more about them! :)

Stranger6667 avatar Aug 10 '25 13:08 Stranger6667

# Runs only the minimized version
schemathesis replay /path
# Runs the whole sequence of API calls
schemathesis replay --full /path

By “the whole sequence of API calls” do you mean replaying the full pre-shrinking sequence of requests? If so, it becomes difficult to collect per-request logs, since there’s no way to run a single request in isolation -- which was the original goal of this feature request. (Here, “request” == “test case”: in stateful fuzzing, it’s a sequence of requests; in stateless fuzzing, it’s a single request.)

Similarly, if all pre-shrinking requests are always re-run together with the final request, some issues may be masked because they execute back-to-back.

Would it make sense to add more granularity to the command? For example: schemathesis replay /path --full --index 0. Does that feel too complex?

g3rda avatar Aug 13 '25 09:08 g3rda

By “the whole sequence of API calls” do you mean replaying the full pre-shrinking sequence of requests?

Yes, as a way to see everything that happened.

If so, it becomes difficult to collect per-request logs, since there’s no way to run a single request in isolation -- which was the original goal of this feature request. (Here, “request” == “test case”: in stateful fuzzing, it’s a sequence of requests; in stateless fuzzing, it’s a single request.)

You are right! Perhaps I've shared that part of the API design without putting it in the big picture and missed a few things. But it is definitely not set in stone, and I believe there will be a few more iterations, and the final result will cover the use cases you mentioned.

Would it make sense to add more granularity to the command? For example: schemathesis replay /path --full --index 0. Does that feel too complex?

Absolutely! What do you think about adding filters like this?

# Runs the whole sequence of API calls
schemathesis replay /path

# Select a subset of requests
schemathesis replay /path --include-by=<ID / MINIMIZED_ONLY / SEQUENTIAL_INDEX>

Maybe there could be mutually exclusive filters:

  • --include-minimized-only
  • --include-by-id=<ID>
  • --include-by-sequential-id=<SEQUENTIAL_INDEX>

I'll research how other tooling does it and also play with the naming.

By the way - what is the workflow with sequential ids? CLI outputs short codes, and I thought that they could be the primary way to identify a request

Stranger6667 avatar Aug 14 '25 23:08 Stranger6667

In any event I plan to finish a few things for the 4.1 release and will work on this issue more focused (somewhere next week)

Stranger6667 avatar Aug 14 '25 23:08 Stranger6667

I think the mutually exclusive filters you suggested would cover my use case really well - they’re intuitive, easy to work with, and extend previous replay functionality without taking anything away.

--include-by-id provides the main functionality I need, and it could likely be enough for my workflow. Using sequential IDs could simplify it a bit by allowing direct replay from the crashes/ folder, without needing to parse the cassette/crash file first to extract the IDs which need to be rerun. That said, even with sequential IDs, we’d still need to know how many test cases are there, so some parsing will be needed.

Now I am also thinking that --include-by-id could be used for replaying individual requests, while --include-by-sequential-id could be used for the whole testcase (multiple requests in case of stateful fuzzing). My assumption is that stateful test cases consist of multiple requests (e.g., R1, R2, R3), and these are replayed in full sequence each time, not just repeating the final request (i.e., R1, R2, R3, R3, R3). So in this case, it could be useful to replay the sequence in full. I’m not entirely sure if this thinking is accurate, feel free to correct me if not.

g3rda avatar Aug 25 '25 09:08 g3rda