zos icon indicating copy to clipboard operation
zos copied to clipboard

Revisit reserved cache and memory size

Open muhamadazmy opened this issue 3 years ago • 2 comments

Currently zos reserve 100g of the ssd storage, we need to revise this value because it's too much.

Also revise the amount of reserved memory for the system

muhamadazmy avatar Aug 14 '22 18:08 muhamadazmy

I had a quick conversation with Kristof about the reserved system resources. Currently we have always 100GB of ssd "reserved" for zos cache . Also a 10% with min value of 2GB of node memory is also reserved for the system.

He thinks that's too much (specially the storage) and we need to revise those values. For storage, the 100G is just a subvolume quota but it's taken into consideration while calculating how much free storage is available for workloads.

The problem is that the "reserved" storage amount is not reported by the node, instead it's right now known by the gridproxy and it always just assume this amount is used by the system.

muhamadazmy avatar Aug 16 '22 08:08 muhamadazmy

@xmonader I suggest the following to avoid minimum change to the grid

  • Right now, the node reports FULL ssd capacity as SRU. while it internally reserve 100G. Only the proxy knows about this value right now (terraform?) and makes sure to take it into account during capacity planning here. This makes it hard to make each node has a different value without changing the node object on the grid to also report "reserved for system" capacity.
  • Instead, what if the node reports only usable capacity (so total - reserved) for both storage and memory. This way each node can has different reserved value (and change it dynamically if needed), then grid proxy doesn't need to know about any node internal reservation.
  • This will also change minting of course which is gonna be a problem.
  • Another solution is that the node object then should have both "total capacity" (as right now) , and a new field "system reserved". So minting can use total, and capacity planning can use both (plus active contracts) to filter out nodes

muhamadazmy avatar Aug 16 '22 09:08 muhamadazmy

I think we need to implement this with #1830 because this will change what is being reported by the node as full capacity. the idea is that we start with a small reservation (say 10G) then monitor usage of storage disk, and increase the size if needed

muhamadazmy avatar Nov 15 '22 08:11 muhamadazmy

Provisiond need to be aware that reserved cache size is dynamic and can change in runtime

muhamadazmy avatar Mar 20 '23 15:03 muhamadazmy