wg-chaoseng icon indicating copy to clipboard operation
wg-chaoseng copied to clipboard

Draft chaos engineering definition/whitepaper

Open caniszczyk opened this issue 6 years ago • 28 comments

caniszczyk avatar Apr 26 '18 20:04 caniszczyk

Keen to help with that !

seeker89 avatar May 10 '18 16:05 seeker89

Happy to support the effort too.

Lawouach avatar May 13 '18 17:05 Lawouach

Me to :)

3rdman avatar May 18 '18 01:05 3rdman

Ping

mattforni avatar May 18 '18 20:05 mattforni

the best bet is currently to contribute to the proposal here which is sketching out a bit of an outline of what can become a whitepaper/landscape:

https://docs.google.com/document/d/1BeeJZIyReCFNLJQrZjwA4KMlUJelxFFEv3IwED16lHE/edit?ts=5ace0eab#heading=h.k8f5ndt8affu

Here are my ideas for a draft outline, would love feedback since I'm new to this space still:

  • What is chaos engineering?
  • A history of chaos engineering
  • Chaos Engineering Use Cases
  • Planning Experiments
  • Chaos Engineering in Cloud Native Systems
  • Chaos Culture: Planning Chaos/GameDays
  • Conclusion

caniszczyk avatar May 18 '18 22:05 caniszczyk

ping

ramin avatar May 21 '18 20:05 ramin

@caniszczyk That document is likely getting hard to navigate, and make sense of. I'm happy to move it to this repo so we can start using GH issues instead.

While GH is not a document-collaboration tool, I guess, should we clearly mark each section in the proposal, we could simply refer to each section from GH issues for discussions.

Lawouach avatar May 21 '18 20:05 Lawouach

+1 to moving to GitHub

On Mon, 21 May 2018, 21:47 Sylvain Hellegouarch, [email protected] wrote:

@caniszczyk https://github.com/caniszczyk That document is likely getting hard to navigate, and make sense of. I'm happy to move it to this repo so we can start using GH issues instead.

While GH is not a document-collaboration tool, I guess, should we clearly mark each section in the proposal, we could simply refer to each section from GH issues for discussions.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chaoseng/wg-chaoseng/issues/3#issuecomment-390777617, or mute the thread https://github.com/notifications/unsubscribe-auth/AAdUOqZhtB29AnwHH1k71IQ2VZFZsqQAks5t0yfngaJpZM4TnmJy .

-- Mikolaj Pawlikowski

seeker89 avatar May 21 '18 20:05 seeker89

Regarding the outline @caniszczyk, it's a good starting point. I might add a section regardng chaos engineering in relation to other disciplines/practices: security, CI/CD... basically, where does CE fit in the toolchain? But, maybe, this is covered by the "CE in Cloud Native Systems"?

Lawouach avatar May 21 '18 20:05 Lawouach

I agreee with @Lawouach and @seeker89, the Google doc got crowded fast :)

We could just do a bit of Markdown on individual sections and then generate something, e.g. a PDF, when needed.

3rdman avatar May 21 '18 22:05 3rdman

on the suggestion from everyone, I converted what we had in the gdoc to here:

https://github.com/chaoseng/wg-chaoseng/blob/master/WHITEPAPER.md

It needs a lot of work but now we can start iterating via pull requests.

cc: @chaoseng/maintainers

caniszczyk avatar May 22 '18 14:05 caniszczyk

@caniszczyk +1

joaoasrosa avatar May 22 '18 15:05 joaoasrosa

Hey all,

Here is a strawman of structure for the whitepaper. Hopefully will help the discussion :)


Chaos Engineering Whitepaper v0.1

What is Chaos Engineering?

Short History

Principles

Objective: Harness and Improve System Resilience

Benefits for Cloud Native Systems

Relation to Existing Software and Operational Practices

Use Cases

Practicing Chaos Engineering

Chaos Engineering Flow

Define a Baseline

State the Hypothesis to Confirm/Infirm

Determine a Perturbation to Perform

Chaos Engineering Perturbations

Degrade Network Conditions

Vary Computing Resources

Stress to the Limits

Simulate Data Loss

Change ACLs Permissions

Provoke a Security Breach

Chaos Engineering Automation

Continous Chaos Engineering

Chaos Engineering Reporting

Report Findings

Lawouach avatar Jun 22 '18 14:06 Lawouach

Hi @Lawouach

Thank you for taking to the time to organize things a bit. Where does the landscape fit in this structure ? Can it be put in another document?

veggiemonk avatar Jun 25 '18 09:06 veggiemonk

Hey @veggiemonk. Thanks, it looks like nothing when I look at it now but finding the right phrasing took me half a day the other day. Formalizing is hard :D

It depends on how we organize the whitepaper, either we list a bunch of examples for each section (so for instance on "Degrade Network Conditions", we could indicate Gremlin, Pumba, Muxy...) so that there is locality between the topic and potential vendors.

Or we continue with a long list of vendors at the bottom of the paper.

Lawouach avatar Jun 25 '18 09:06 Lawouach

Hi @Lawouach, I totally understand that's hard work! 🙏

For now, the landscape doesn't need to be too formal because the list isn't that long actually. As a suggestion, let's keep it it at the end.

What do you think?

I don't know if the white paper is the right place for that but what about renaming the section "Chaos Engineering Flow" to "How to start Chaos Engineering". As a first step, we could add "setup monitoring" As a second step, we could "Warn users/developers about it" ?

It seems pretty basic but without that it can be hard/dangerous to do CE. Maybe it is too simple for this paper.

What are your views on that?

veggiemonk avatar Jun 25 '18 10:06 veggiemonk

Interesting, I like the guidelines approach indeed.

There is certainly room for a section around the theory, as per the principles. But a "how to get started" one would be very welcome indeed!

Lawouach avatar Jun 25 '18 10:06 Lawouach

How to get started + Links to product landscape and getting started points there would be awesome

russmiles avatar Jun 25 '18 10:06 russmiles

Ok let's see what kind of resources we can gather in there.

veggiemonk avatar Jun 25 '18 12:06 veggiemonk

A section of case studies and papers around the field was something we discussed in the last meeting also. Maybe as a very final section on 'Further Reading' ?

@Lawouach thank you so much for getting this started!

What do people think about starting a branch with @Lawouach's structure as a README we can start opening PRs against with sections filled in, a merged PR is an approval and we can go deeper on specific content for each section, then link to each PR in this issue?

ramin avatar Jun 25 '18 15:06 ramin

I think I will refine taking comments that were made. Give me a moment :)

Lawouach avatar Jun 25 '18 15:06 Lawouach


Chaos Engineering Whitepaper v0.1

What is Chaos Engineering?

Short History

Principles

Discuss the steady state, experiment, etc. Just to set the "theory"?

Why practicing Chaos Engineering?

Harness and Improve System Resilience

If Chaos Engineering isn't the goal per-se, what is? Resiliency? Reliability?

Benefits for Cloud Native Systems

Software and Operational Practices In Production

A clear indication that whereas testing, CI/CD are mostly upstream practices, Chaos Engineering is very much downstream and act against a live system. would that make sense?

Use Cases

The current use-cases are a good starting point but should we detail them? Similar to the depth we can find in the serverless whitepaper?

Practicing Chaos Engineering

Getting Started With Chaos Engineering

Is my system ready to endure Chaos Engineering?

Should we hint at what minimal level you need to be before getting started? I mean, what if your system is barely resilient as it is?

Do I need to get started in production?

While we may want this, starting in prod may not fit "getting started scenarios".

Communicate with the Organization

This is where we need to continue the discussion and figure out how far we want/can go with the patterns.

Should we talk gamedays for instance? Observability?

The following phases may or may not be useful. I think it would be valuable if we could describe what it means to deal with chaos in those various cases, but is it the right place?

Chaos Engineering Perturbations

Degrade Network Conditions

Vary Computing Resources

Stress to the Limits

Simulate Data Loss

Change ACLs Permissions

Provoke a Security Breach

Assume application fails to restart

Chaos Engineering Automation

Continous Chaos Engineering

Chaos Engineering Reporting

Report Findings

Landscape


Lawouach avatar Jun 25 '18 15:06 Lawouach

That looks good! Thanks @Lawouach for the hard work!

I think a PR is in order for us to move forward.

veggiemonk avatar Jun 26 '18 06:06 veggiemonk

@chaoseng/maintainers (CC @caniszczyk) so just out of curiosity what is the plan on iterating on this document now? I had a few minutes this afternoon and wanted to add some of my thoughts here, but it's a bit difficult to know where to start.

I'm happy to just take some time, make some edits and submit a PR for consideration, but didn't want to ruffle any feathers or step on any toes. Would it be beneficial to assign topics to individuals to comment on? Just thinking out loud here.

mattforni avatar Jun 27 '18 20:06 mattforni

Hey @mattforni, I'd say it's totally fine to offer PRs to the document?

On my side, I used this issue as it felt more rapid to get started but I wonder if that would scale for a whole document indeed :D

Lawouach avatar Jun 28 '18 08:06 Lawouach

PRs are the way to move forward! ⏩

veggiemonk avatar Jun 28 '18 13:06 veggiemonk

PRs please :)

On Thu, Jun 28, 2018 at 8:54 AM, Julien Bisconti [email protected] wrote:

PRs are the way to move forward! ⏩

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chaoseng/wg-chaoseng/issues/3#issuecomment-401042894, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD5IUlInxWj6BOU6vOOWFOqMM63-Cf3ks5uBOAHgaJpZM4TnmJy .

-- Cheers,

Chris Aniszczyk http://aniszczyk.org +1 512 961 6719

caniszczyk avatar Jun 28 '18 16:06 caniszczyk

Started on my trail of thoughts https://github.com/chaoseng/wg-chaoseng/pull/41

Lawouach avatar Jul 06 '18 11:07 Lawouach