timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

[Enhancement]: Over partitioning of chunks to make future node balancing optimal

Open nikkhils opened this issue 2 years ago • 5 comments

What type of enhancement is this?

User experience

What subsystems and features will be improved?

Access node, Multi-node, Node management, Partitioning

What does the enhancement do?

To provide multi-node elasticity a data node will be added to an existing multinode cluster. We expect to do chunk rebalancing to allow us to use this newly added data node effectively.

We need to investigate if over-partitioning of the chunks on existing datanodes will be useful to allow us to do effective rebalancing to new added data nodes.

We could also consider a max limit of data nodes that we want to support if that helps in doing the over partitioning effectively.

Implementation challenges

We expect changes in TimescaleDB and thus will be dependant on a release.

Related issues:

  • https://github.com/timescale/timescaledb/issues/3287
  • https://github.com/timescale/timescaledb/issues/1899

nikkhils avatar Feb 24 '22 12:02 nikkhils

Here's a write up of current state of partitioning with a discussion about necessary changes to support re-balancing: https://docs.google.com/document/d/1kWemwjy9J15lOrmj0mjPZsmhrJ3YO6NzmOxz5qAFAL8/edit?usp=sharing

erimatnor avatar Apr 14 '22 12:04 erimatnor

Transferred from duplicate/overlapping #4254:

Space partitioning is commonly used in multi-node to assign specific space partitions to data nodes so that partitionwise aggregation can be used. However, when the number of space partitions is changed (e.g., when adding additional data nodes), the space dimension is re-divided among all data nodes, breaking the previous partitioning scheme and preventing partitionwise aggregation.

One way to deal with this is to over-partition initially (i.e., creating more space partitions than there are data nodes) so that an entire space partition (or several) can be given to new data nodes without having to "repartition". However, the current chunk assignment in TimescaleDB is not sticky along the space dimension, so new chunks might not end up on the node that is responsible for the space partition the chunk belongs to.

To fix this issue, the chunk-data node assignment needs to be updated to make it "sticky". In other words, if a space partition is moved to a new data node, then also new chunks in that partition should be created on the right data node. While it might be possible to do this computationally without keeping state, a stateful approach might be preferred.; such an approach is easy to introspect and users might even be able to manually reassign responsibility for space partitions to specific data nodes.

Implementation challenges

A stateful apprach likely requires a new metadata table that looks something like this:

| partition id | range start | range end | data nodes | --------------- |------------------------|-----------------| | 1 | -INF | 100 | dn1, dn2 | 2 | 100 | 200 | dn2,dn3 | 3 | 200 | INF | dn3, dn1

erimatnor avatar Jun 02 '22 12:06 erimatnor

During implementation, I realized there's a lot of events and cases when the partition mappings table needs to be updated, e.g., when attach/detach data nodes, changing the replication factor, number of partitions, etc. When assigning new data nodes to partitions, or splitting partitions, there are a lot of potential choices to make (e.g., which data node should be the "primary" and which ones should be replicas) which requires some thought.

In order to not make the initial PR too complicated it makes sense to try to first merge something that is less "smart" about these cases.

Also, in order to not miss any of these updates required to partition mappings based on these events, the design doc has been updated with information on actions to take on each of the events: https://docs.google.com/document/d/1kWemwjy9J15lOrmj0mjPZsmhrJ3YO6NzmOxz5qAFAL8/edit#

erimatnor avatar Jun 09 '22 12:06 erimatnor

Merged the first PR to implement better over-partitioning support: https://github.com/timescale/timescaledb/pull/4443

erimatnor avatar Aug 09 '22 12:08 erimatnor

Merged the first PR to implement better over-partitioning support: https://github.com/timescale/timescaledb/pull/4443

erimatnor avatar Aug 09 '22 12:08 erimatnor