Style guide
Per some Gitter discussion, a script of when to use certain features when you can use more than one would be helpful.
Style guide and cookbook
Onyx don't: submit only one segment to a job and let it multiply many orders of magnitude within the job. You give up fault tolerance and max-pending's natural backpressure completely with this approach.
Onyx do: track retry counts via metrics. If retries are becoming too high, you can adjust your max-pending on inputs, or increase the pending timeout.
FAQ: too many retries (otherwise known as why are some things coming out of order?!?) Max pending may be too high. Make sure you're batching adequately (writers may be falling behind), or the general overhead of small batches may be too large
FAQ: why is my complete latency so high? Maybe you have too many intermediate tasks, or they each generate too many intermediate segments Maybe your throughput isn't high enough Your batch timeout may be too high (if you're not high volume you may be hitting the batch timeout before emitting) You may want to reduce the pending-timeout if it's your retries that turning around too slowly (be careful you don't start a retry storm)
FAQ: why do I X A: get metrics setup
Q: how do I filter out segments A: flow conditions or onyx/fn that returns empty vector
Q: how do I add extra behaviour to my tasks outside of onyx/fn? A: have a look at the lifecycle docs (link)
Q: how should I benchmark on a single machine A: definitely turn off messaging short circuiting (link). Only do this for the benchmarking.
Q: some performance question A: https://github.com/onyx-platform/onyx/blob/0.7.x/doc/user-guide/performance-tuning.md
(possibly should revise with some of these answers)
Q: how do I ensure that after I kill a job and start a new one, that it picks back up? A: look into the check pointing features of your given plugin Related Q: how do I do rolling deploys A: insert best practices here.
Onyx dos: 3-4 smaller nodes > one bigger node, fault tolerance wise
Onyx do: see if you can rationalise how many tasks you have, especially if they merely feed in to each other. You should generally have one vpeer per core and too many tasks may mean you need to oversubscribe your cores. Plus, it adds extra latency and serialisation overhead, and can cause extra retries
Something about confusion about virtual peers starting up and how they're allocated.
Retries (metrics). Look at batch latency for your tasks. If any of them is some significant proportion of your pending-timeout, then something is wrong. Optimise that task or increase pending-timeout.
Increasing batch size helps increase throughput for plugins, or users of lifecycles that use the whole batch. Increasing batch size with slow function calls on segments can hurt you because they reduce your chance of acking your segment before pending-timeout is over.