toxiproxy icon indicating copy to clipboard operation
toxiproxy copied to clipboard

how to specify multiple attributes? incorrect handling of errors

Open Dieterbe opened this issue 7 years ago • 6 comments

problem 1) it's unclear how to specify multiple attributes when adding a toxic. the example in the readme only uses one attribute (https://github.com/Shopify/toxiproxy#cli-example) and the cli help docs don't specify how to specify multiple attributes (see help msg below). as a user I'm guessing either specify -a key=val,key2=val2 or -a key=val -a key2=val2

cli toxic add                                                                                                                                                          
NAME:
   toxiproxy-cli toxic add - add a new toxic

USAGE:
   toxiproxy-cli toxic add [command options] <proxyName>

OPTIONS:
   --toxicName value, -n value    name of the toxic
   --type value, -t value         type of toxic
   --toxicity value, --tox value  toxicity of toxic
   --attribute value, -a value    toxic attribute in key=value format
   --upstream, -u                 add toxic to upstream
   --downstream, -d               add toxic to downstream

Proxy name is required as the first argument.

problem 2) either approach i guessed above seems to work sort of, but things aren't exactly working.

approach 1 - comma separated items - seems to be the right way, see this run in the latest docker container...

$ toxiproxy-cli create cassandra -l 0.0.0.0:9042 -u cassandra:9042
Created new proxy cassandra
$ toxiproxy-cli list
root@toxiproxy:/app/src/github.com/Shopify/toxiproxy# toxiproxy-cli list
Listen          Upstream        Name    Enabled Toxics
======================================================================
[::]:9042       cassandra:9042  cassandra       true    None

Hint: inspect toxics with `toxiproxy-cli inspect <proxyName>`
$ toxiproxy-cli toxic add -t latency -a latency=1000,jitter=500 cassandra                                                                 
Added downstream latency toxic 'latency_downstream' on proxy 'cassandra'
$ toxiproxy-cli list                                                     
Listen          Upstream        Name    Enabled Toxics
======================================================================
[::]:9042       cassandra:9042  cassandra       true    1

Hint: inspect toxics with `toxiproxy-cli inspect <proxyName>`
$ toxiproxy-cli inspect cassandra
Name: cassandra Listen: [::]:9042       Upstream: cassandra:9042
======================================================================
Upstream toxics:
Proxy has no Upstream toxics enabled.

Downstream toxics:
latency_downstream: type=latency stream=downstream toxicity=1.00 attributes=[ jitter=500 latency=1000 ]

Hint: add a toxic with `toxiproxy-cli toxic add`

... so the inspect output looks good. however, I'm having issues getting it to work. it seems to be deplaying requests some of the time but also long periods of undelayed, non-toxic proxying.

so I tried the other way but now latency is set to 0 for some reason, so maybe it's the comma separated approach after all? edit: in testing the toxic does seem to work effectively

root@toxiproxy:/app/src/github.com/Shopify/toxiproxy# toxiproxy-cli list
Listen          Upstream        Name    Enabled Toxics
======================================================================
[::]:9042       cassandra:9042  cassandra       true    None

Hint: inspect toxics with `toxiproxy-cli inspect <proxyName>`
root@toxiproxy:/app/src/github.com/Shopify/toxiproxy# toxiproxy-cli toxic add -t latency -a latency=1000 -a jitter=500 cassandra                                                              
Added downstream latency toxic 'latency_downstream' on proxy 'cassandra'
root@toxiproxy:/app/src/github.com/Shopify/toxiproxy# toxiproxy-cli list                                                        
Listen          Upstream        Name    Enabled Toxics
======================================================================
[::]:9042       cassandra:9042  cassandra       true    1

Hint: inspect toxics with `toxiproxy-cli inspect <proxyName>`
root@toxiproxy:/app/src/github.com/Shopify/toxiproxy# toxiproxy-cli inspect cassandra
Name: cassandra Listen: [::]:9042       Upstream: cassandra:9042
======================================================================
Upstream toxics:
Proxy has no Upstream toxics enabled.

Downstream toxics:
latency_downstream: type=latency stream=downstream toxicity=1.00 attributes=[ jitter=500 latency=0 ]

Hint: add a toxic with `toxiproxy-cli toxic add`
root@toxiproxy:/app/src/github.com/Shopify/toxiproxy# 

When I try this outside of the docker container, using the latest code in git, the comma separated approach doesn't work at all and I have to switch to multiple -a args

$ ./cli toxic add -t latency -a latency=1000,jitter=500 web
Failed to add toxic: AddToxic: HTTP 400: bad request body: json: cannot unmarshal string into Go value of type int64

$ ./cli toxic add -t latency -a latency=1000 -a jitter=500 web
Added downstream latency toxic 'latency_downstream' on proxy 'web'

the first one submits json like:

{"name":"","type":"latency","stream":"downstream","toxicity":1,"attributes":{"latency":"1000,jitter=500"}}

which looks invalid. the second approach submits:

{"attributes":{"latency":1000,"jitter":500},"name":"latency_downstream","type":"latency","stream":"downstream","toxicity":1}

Dieterbe avatar Oct 20 '16 11:10 Dieterbe

Hmm seems like the code recently moved from csv to multiple -a's. see 9a825b777e9bf44ecf1ce80ea05c6d13b07c755a and 87b6b616a0643f9c71b68a67996ff55da666b1d7

though 629724a77bd7013f3e5b99a853473eb4d310c6fb was just merged which introduces a csv based example again.

Dieterbe avatar Oct 20 '16 11:10 Dieterbe

Your findings are pretty much correct: Toxiproxy 2.0.0's cli uses comma separated attributes, and only the last -a flag is used due to how the cli parser works. This was changed for the upcoming 2.1.0 cli, which fixes the multiple -a issue, but removes the comma separated format due to issues with string values.

I'm assuming the reason the first json blob you looked at seems invalid is because of the blank name. This is actually fine since the server will use the default name of <type>_<stream>, but this was also changed for 2.1.0 to have the cli handle this default as well.

As for your jitter=500 latency=1000 test, the expected behaviour would be random latency between 500ms and 1500ms. I'm not sure how much latency you were actually getting based on your description, but there might be something else going on if you were getting any less than 500ms latency on anything.

Sorry for all the confusion though, the order release / doc changes are being merged in don't line up as well as the probably should.

xthexder avatar Oct 20 '16 16:10 xthexder

I'm assuming the reason the first json blob you looked at seems invalid is because of the blank name.

it looked invalid to me because the latency attribute has a value of 1000,jitter=500 which the http api can't decode because it expects an integer. this should result in a cleaner error for the user. suggestions:

  • perhaps the cli should validate that values that should be numeric validate to numbers, and/or maybe a warning that csv is no longer supported, before even hitting the server.
  • or if all validation is supposed to be server side, then the error could be clearer as well (it should mention at least which field failed)

As for your jitter=500 latency=1000 test, the expected behaviour would be random latency between 500ms and 1500ms. I'm not sure how much latency you were actually getting based on your description, but there might be something else going on if you were getting any less than 500ms latency on anything.

yes, it seemed like a bunch of my requests were responding in a few ms, this for minutes in a row with many a few hundred req/s, as if no latency nor jitter was being introduced at all. the chart looked like a flat line at 3ms or so. but this was using the wrong csv approach so maybe the toxic was not activated at all and the latency that I did see might have been the consequence of my benchmark workload.

As for the problem with toxiproxy-cli toxic add -t latency -a latency=1000 -a jitter=500 cassandra resulting in latency=0, I'm guessing the docker image is not up to date. when i build the tools locally i have to use the multiple -a approach, but in docker that doesn't seem to work. this matches your description of the issue where it only applies the last attribute (the jitter in this case)

FWIW in our projects we use CircleCI which has a free plan for open source projects, and we use it to automatically push docker images to dockerhub whenever a build succeeds (see https://github.com/raintank/metrictank/blob/master/circle.yml for example) this approach is working well for us.

Dieterbe avatar Oct 21 '16 08:10 Dieterbe

update regarding the "it seems to be delaying requests some of the time but also long periods of undelayed, non-toxic proxying." problem:

  1. using #140 i built a new docker container that includes the latest stuff from master.
  2. i did a cycle where i run toxiproxy-cli toxic add -t latency --tox 0.5 -a latency=3000 cassandra and then toxiproxy-cli toxic delete -n latency_downstream cassandra a few times back and forth (with a couple minutes in between), while a tool is running that queries at a rate of 100Hz (see blue line)

check out the this chart where I did this 3 times toxiproxy-detox

it shows the avg and max duration of a cassandra request (capped at 2s because my application times out queries and aborts them after 2s, that's why we're not seeing 3s there), measured each second. here are my observations: A) the toxic works more or less as expected for about 30s. the observed max values hit 2s although it doesn't look like a 50% rate since some seconds pass with a max of 3ms and an avg of 0ms. anyway this decrepancy might be due to statsd/graphite related artifacts or perhaps the way my app measures. B) afterwards, the toxicity seems to just disappear, always after 30s. when I then run the delete there's no difference whatsover in the numbers. if I then add the toxic again, it seems to work again (for 30s)

at one point i thought perhaps the cassandra client does something weird where it disconnects the original connection and then connects directly to my cassandra server which would explain why the toxic effect disappears, but everytime i delete and readd the toxic the toxicity reappears so that disproves that idea.

Dieterbe avatar Oct 21 '16 13:10 Dieterbe

PS: sorry for throwing a bunch of possibly unrelated stuff into the same ticket, but it's not yet clear to me what are still actual issues and how to cleanly separate them from one another. I hope we can solve and clarify much of this before the need to open new tickets, and i hope to contribute solutions as well.

Dieterbe avatar Oct 21 '16 14:10 Dieterbe

To be clear about versioning, currently git/master is "unreleased" and is subject to change. It's effectively development for Toxiproxy 2.1.0 right now. Docker and all release builds are Toxiproxy 2.0.0. The readme should also be documentation for 2.0.0 only, but it looks like I made a couple mistakes when merging things (go through and fix that soon).

Toxicity is per-connection, so what I think is happening is probably that your cassandra client is using a connection pool with persistant connections, so after 30 seconds of timeouts, the bad connections are probably being removed from the pool, leaving only non-toxic connections.

And lastly about the cli problems in master, I missed the bad attributes the first time I looked. I'll have to make some changes so that the error is better. The cli is designed to be independant of what toxics are defined, so that if a new toxic is added, the cli and api will still work with older versions. Technically "1000,jitter=500" might be a valid attribute for some other toxic type, so the error should be handled on the server.

I'd recommend using version 2.0.0 until 2.1.0 is fully released, since it will have the least issues.

xthexder avatar Oct 21 '16 20:10 xthexder