etcd icon indicating copy to clipboard operation
etcd copied to clipboard

invitation to use gosimnet: network fault-injection simulation for Go

Open glycerine opened this issue 6 months ago • 11 comments

What would you like to be added?

With the new Go testing/synctest facility[1] (experimental in go1.24.3, to be released in go1.25), new vistas for strenuous and more reproducible testing of distributed Go systems become possible.

With synctest in mind, I have written a network simulator for Go which can inject network partitions, asymmetric network card faults, and network faults that behave probabilistically rather than simply dropping all messages. It can be found here:

https://github.com/glycerine/gosimnet

Package docs: https://pkg.go.dev/github.com/glycerine/gosimnet

I saw talks in the recent past[2][3] where etcd devs seemed to wish for better testing capabilities for etcd. Therefore I invite etcd developers and any other interested parties to make use of gosimnet in testing. Gosimnet is useful for testing any distributed system written in Go.

[1] https://github.com/golang/go/blob/934d5f2cf703c6aad9f0ce6a73a3922d1af83049/src/testing/synctest/synctest.go#L5

[2] "Lessons Learned From Etcd the Data Inconsistency Issues - Marek Siarkowicz & Benjamin Wang" https://youtu.be/_nSJ_vozyqk?t=1387

[3] "On the Hunt for Etcd Data Inconsistencies - Marek Siarkowicz, Google" https://youtu.be/IIMs0EjQZHg?t=187

Why is this needed?

See the talks in [2] and [3] above.

Marek Siarkowicz suggested there that Etcd lacks the capability to do extensive fault-injection simulations at the moment.

glycerine avatar Jun 13 '25 04:06 glycerine

I would add that I have been using the rr (record and replay) debugger

https://github.com/rr-debugger/rr

to record and then replay exactly tests that find bugs.

The combination of rr and gosimnet is incredibly powerful for finding and fixing thread inter-leaving related bugs, especially when using rr record -h to invoke chaos mode.

glycerine avatar Jun 13 '25 04:06 glycerine

Those seems like a great tool, however in the meantime etcd testing has integrated with Antithesis platform. More details in https://github.com/etcd-io/etcd/issues/19299. It provides solution for both problems you mentioned, a deterministic simulation testing to reproduce bugs and fault injection to simulate network failures. cc @marcus-hodgson-antithesis

The engagement with Antithesis is sponsored by CNCF for 3 months, but I don't know whether it will be renewed in the future. If it doesn't happen and we no longer be able to use Antithesis platform, we will definitely look into the tools you listed.

serathius avatar Jun 13 '25 07:06 serathius

Hi Marek, I'm so jealous you have access to Antithesis!

Largely my recent work has been thinking about how to do similar things but without access to their expensive proprietary tech.

...which started because I am implementing Raft... and now Protocol Aware Recovery (because otherwise Raft can can globally lose data from a single disk fault...) and trying to figure out how to prove various optimizations correct, and test the actual implementation. Ivy seems like a very nice tool in the former direction, just by the by...

Anyway, I'm very much inspired by Will Wilson and Joran Greef's talks...

What I've found though is that without Antithesis, rr + chaos is the next best thing--and it can be complimentary to DST like approaches.

The Go runtime is always going give you a non-deterministic run, but with rr you can still record every run, and simply discard those that found nothing.

The rr -h chaos mode in particular is designed to provoke threads inter-leavings that I doubt Antithesis will find very quickly, if at all[1]

You can read more about it here:

https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html

https://robert.ocallahan.org/2016/02/deeper-into-chaos.html

[1] As evidence, did Antithesis find this issue that rr chaos mode immediately found in the Go runtime? https://github.com/golang/go/issues/74019 (you have to be using -race )

glycerine avatar Jun 13 '25 07:06 glycerine

Heh, seems like everyone is implementing Raft this days :P Sad to say but, while everyone is implementing raft from scratch, it's very hard to maintain it, even for etcd project. Unfortunately no longer have active maintainers working on https://github.com/etcd-io/raft. Maybe it will interest you.

serathius avatar Jun 13 '25 07:06 serathius

Raft is certainly a right of passage.

I gave up using etcd myself a decade ago, after fixing a few bugs in it, and then encountering this attitude that readily sacrified safety for speed:

https://github.com/etcd-io/etcd/issues/741#issuecomment-46387760

Moreover, I'm not a fan unproven distributed protocols, since they invariably contain bugs.

And as far as I can tell, etcd's Raft added a Pre-vote state which has no underlying proof.

So I consider that an unrepairable flaw; better to start from scratch as I'm doing.

Don't get me wrong, Pre-vote is an essential thing in Raft, but it should be applied using a Lamport style filtration argument; as an independent layer on top of the proven protocol. When the underlying algorithm is safe against timing, you can apply optimizations which only affect timing without fear. If you add a whole state though, you must redo the proof, or you are building on garbage. I don't see where they redid the proof... which took James Wilcox's Verdi project at UW 50K lines of Coq...

Moreover the PAR paper makes it clear that Raft without the new leader checking the rest of the cluster for storage faults is just fundamentally broken; at the algorithm level, because Raft has no storage fault model. I would argue until you can prevent Figure 3.7 in the Raft dissertation from happening, it applies just as much to overwriting a majority of non-faulty entries with a corrupted one.

it's very hard to maintain it...

Once you have a proven algorithm, I don't see why it should be hard maintain such projects now that we have synctest and gosimnet. I wrote a nice bunch of tests for paritition tolerance, leader election, etc... (heh, cannot resist a good pun). They run very fast because of synctest's fake clock.

glycerine avatar Jun 13 '25 08:06 glycerine

I gave up using etcd myself a decade ago, after fixing a few bugs in it, and then encountering this attitude that readily sacrified safety for speed:

https://github.com/etcd-io/etcd/issues/741#issuecomment-46387760

Sad to hear it, that's an old issue, since then etcd project has committed to safety and is continuously testing linearization via robustness testing and Antithesis testing.

And as far as I can tell, etcd's Raft added a Pre-vote state which has no underlying proof.

So I consider that an unrepairable flaw; better to start from scratch as I'm doing.

Don't get me wrong, Pre-vote is an essential thing in Raft, but it should be applied using a Lamport style filtration argument; as an independent layer on top of the proven protocol. When the underlying algorithm is safe against timing, you can apply optimizations which only affect timing without fear.

I'm not expert on this, but thanks to @joshuazh-x work we have developed a TLA+ model for etcd raft. Quickly reading it through, it doesn't include Pre-vote. Maybe that something we should improve. Happy to get your feedback what we could do better.

Moreover the PAR paper makes it clear that Raft without the new leader checking the rest of the cluster for storage faults is just fundamentally broken; at the algorithm level, because Raft has no storage fault model.

Right, this is something that was on my mind since writing https://github.com/etcd-io/etcd/blob/main/Documentation/postmortems/v3.5-data-inconsistency.md. I would like etcd to be able to immediately detect and recover from data corruptions. We have been discussing one options in https://github.com/etcd-io/etcd/issues/13839. I was not aware of PAR paper, definitely something I need to read. Happy to get your thoughts on this area too

I don't see why it should be hard maintain such projects now that we have synctest and gosimnet. I wrote a nice bunch of tests for paritition tolerance, leader election, etc... (heh, cannot resist a good pun).

That's the reason I wrote the robustness tests, to make sure we can easily review and catch humans errors during review, making etcd more accessible to more contributors and reducing maintenance cost. However, this leads to oracle's dilemma, who will maintain the tests? They might be even harder to maintain, while I can do it for now, I don't know if I can do that forever and growing people to help me turned out to be a real challenge.

What I'm alluding to is fact that for an open source project like etcd to survive it needs to have a plan to how it will handle maintainers moving on to other things.

serathius avatar Jun 13 '25 09:06 serathius

I was not are of PAR paper, definitely something I need to read.

https://www.usenix.org/conference/fast18/presentation/alagappan

discussion with authors, hosted by the Tiger Beetle CEO Joran Greef, since he's a big fan. https://www.youtube.com/watch?v=1rArr0yb3zg&t=2060s

I'm not expert on this, but thanks to @joshuazh-x work we have developed a TLA+ model for etcd raft.

So there is model checking, which usually can run for a too-small-to-be-realistic number of rounds, and then there is proof; TLA+ the language is used in both. Ongaro wrote a TLA+ spec for Raft, but it proved impossible to model check it, and was very far from being provable. Machine checked proof was by the Verdi team many years later, mentioned above--and crucially did not include the membership change protocols. The recent Recraft paper is very interesting here though, as it is based on proof.

I also mentioned Ivy which I think has great potential for proof here; https://kenmcmil.github.io/ivy/ but I'm still getting familiar with it so I can't comment more. However, the fact that they can prove Vertical Paxos correct in 2-5 seconds gives me some hope. See https://www.youtube.com/watch?v=_suyrSMJeCo (and https://www.youtube.com/watch?v=CE1mcjqea0A)

glycerine avatar Jun 13 '25 09:06 glycerine

@glycerine Antithesis can definitely find concurrency & race condition bugs! We use thread pausing on instrumented code (documented here if you are curious, and we have a bunch of other faults documented here)

@marcus-hodgson-antithesis

Nice! I'm very bullish on Antithesis, but I wonder if it overlaps with rr's chaos mode at all.

I think of them as doing very different Markov Chain Monte Carlo explorations of the infinite execution state space. Chaos mode is very, very different from a normal execution.

The docs section on thread pausing you pointed out sounds much simpler than the chaos mode described here: https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html https://robert.ocallahan.org/2016/02/deeper-into-chaos.html

If you are up for a mini-challenge, to see if Antithesis can detect the same or similar issue in the Go runtime (under go test -race) that rr record -h (chaos mode) did, here is a Docker file to reproduce it, and the crash that rr sees is in the log file. A full replayable trace is in the repo as well.

https://github.com/glycerine/can-antithesis-find-issue74019

edit: If you need to justify the machine time to your boss, it is hard to imagine a stronger public benefit than finding/reproducing bugs in the Go runtime; especially when etcd itself depends on that same runtime. Plus on the marketing side, it would help us see if Antithesis really could potentially replace uses of rr, or if the two tools will remain complimentary.

It's probably wishful thinking to imagine this will really happen, despite my exceedingly rational arguments above... :) but since it is off limits to us mere mortals, we'd all love to see Antithesis proudly strut its stuff!

glycerine avatar Jun 13 '25 23:06 glycerine

I also mentioned Ivy which I think has great potential for proof here; https://kenmcmil.github.io/ivy/

If a bit of social proof would help to spark interest in Ivy, I note that Apple hardware engineers are using Ivy to verify memory subsystems. This has prompted follow on research by Ken McMillan (from last year; 2024) https://link.springer.com/chapter/10.1007/978-3-031-65627-9_13

To quote the abstract and motivation from the second paragraph:

Abstract

While the problem of mechanized proof of liveness of reactive programs has been studied for decades, there is currently no method of proving liveness that is conceptually simple to apply in practice to realistic problems, can be scaled to large problems without modular decomposition, and does not fail unpredictably due to the use of fragile heuristics. We introduce a method of liveness proof by relational rankings, implement it, and show that it meets these criteria in a realistic industrial case study involving a model of the memory subsystem in a CPU.

2nd paragraph:

The inspiration to study this problem comes from an effort to prove liveness of models of memory systems that have been developed by hardware engineers at Apple, Inc. The engineers use a tool and language called Ivy [18] to prove safety properties of memory subsystem models. These properties guarantee the consistency of memory operations from the point of view of the processor cores. Liveness of these models is considered important, in part to ensure the liveness of the underlying hardware implementation, but also to guarantee that the consistency proofs are not vacuous, owing to oversights in the models that might result in deadlock. Unfortunately, proving liveness using an existing approach implemented in Ivy was found by the engineers to be excessively difficult.

The rest of the paper is about how they solved that problem.

Update: I was able to repeat the verification in Ivy of the provided example of Apple's (generic; MIT licensed) memory sub-system specification on a single core of my 5+ year old hardware in 135 minutes (the paper claims 115 minutes on a surely more modern laptop). This is the spec, and that run proved liveness, not just safety requirements, of this unbounded/infinite state system. (This is a proof, vastly stronger than any finite state model check.)

https://github.com/kenmcmil/ivy/blob/master/doc/examples/apple/ord_live.ivy

By inspection, I suspect that this hardware system is far more elaborate than Raft will require, giving me more hope that a Raft + some optimizations implementation can be proven both safe and live in a reasonable amount of time.

Also, Ivy is implemented in alot of Python, which probably accounts for the use of only a single core. With a Go version and a multicore system, we can probably verify much faster. There were 16 isolates verified in the Apple spec, and if I'm understanding it correctly, each is an independent (parallelizable) task. Also, Z3 has likely gotten faster (now at v4.15.2) since the version that Ivy hardcodes to prevent API breaks (v4.7.1; from 2018).

glycerine avatar Jun 15 '25 19:06 glycerine

Nice:

Ivy is able to verify a basic Raft spec (courtesy of James Wilcox, the first author of the Verdi project that did the first formal proof of Raft in Coq)...

... in just 71 seconds on my laptop. Woot!

Spec: https://github.com/glycerine/ivy-jea/blob/master/jea/raft.ivy

This repo has my recent fixes to Ivy to get the GUI (a Tk app over X11) part working again:

https://github.com/glycerine/ivy-jea

I've attached a Docker image and setup/run-script to the repo make it easier to get started:

https://github.com/glycerine/ivy-jea/pkgs/container/ub24 https://github.com/glycerine/ivy-jea/blob/master/dr.ub.sh.txt

(Though with the docker image, you'll need to apply this fix from today too: https://github.com/kenmcmil/ivy/issues/92 ; https://github.com/glycerine/ivy-jea/commit/73cda1a6a140690a9fd7962ac9df246e66c0e16a ; I would just pull from the above ivy-jea github repo from inside the container, from /home/jaten/pyivy/ivy, to get the latest).

glycerine avatar Jun 17 '25 19:06 glycerine

@aphyr @jepsen-io links to Ivy I mentioned. The ivy-jea fork and ubuntu image have my fixes to make Ivy work again.

glycerine avatar Jun 19 '25 10:06 glycerine

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 19 '25 00:08 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 19 '25 00:10 github-actions[bot]