docs icon indicating copy to clipboard operation
docs copied to clipboard

Add sizing guideline for ballast files

Open florence-crl opened this issue 5 years ago • 4 comments

Florence Morris (florence-crl) commented:

from @knz there's a formula that's easier to understand than to explain. The idea is to combine two things.

  1. how fast their data grows over time. To know this they should use metrics/monitoring and plot their storage growth over days/weeks/months. They also need to understand their storage spikes (e..g Bulk I/O events and the necessary disk space for them)
  2. how fast they are able to react to a "low storage" condition, e.g by adding nodes or more disk space. Some businesses can react within 1 day, others need 2 weeks to work on it.

Once they know these two things, they need to choose a ballast that covers the amount of disk space growing (1) during their reaction period (2).

Examples:

  • They generate 1GB per week, and they need 2 weeks turnaround to grow their disk space, they need 2GB ballast.
  • They generate only 100MB per week, but they perform a bulk i/o event that needs 2GB every day, and they can only react to disk shortage within 2 days, then they probably need 2-3GB ballasts.

One layer of complexity is that the intermediate state of the growth can appear larger than the long-term state, because of RocksDB compactions. For example if they create a lot of data quickly, there will be more disk usage than what they have put in their SQL, until RocksDB compacts it.

Another layer is MVCC: if they delete data, the data is still around until it is GC'ed (zone config, default 25 hours). So if their workload is delete-heavy they need to consider that.

Both things can be reliably ignored if their disk usage evolves slowly (which is common) and they can monitor it at a high level (e.g. our capacity metric in the UI, or if they do their own export using prometheus)

from @jseldess An addition is that we need to strongly recommend that they put alerts in place to notify them of “low storage” conditions so they can set their process in place. For example, when a node is running low on disk space and using prometheus metrics. Ideally, a customer shouldn’t get to the point where they need to use a ballast file.

cc: @Annebirzin @piyush-singh since this ties into observability and alerting

Jira Issue: DOC-453

florence-crl avatar Mar 03 '20 18:03 florence-crl

Zendesk ticket #4842 has been linked to this issue.

RoachietheSupportRoach avatar Mar 03 '20 18:03 RoachietheSupportRoach

These needs also came up at the recent Education Offsite.

jseldess avatar Mar 03 '20 18:03 jseldess

Now that we have automatic ballast files on node startup, do we need detailed guidance here still? @mwang1026, thoughts? Users can still set the ballast-size, so maybe we do?

jseldess avatar Nov 19 '21 23:11 jseldess

I don't think so? We have a default size that we can document (I believe it's something like 1GB or 1% of disk) (But we should check before documenting that exactly :D )

mwang1026 avatar Nov 30 '21 16:11 mwang1026