citus Shard splitting to the same node can make node run out of disk space due to lots of unconsumed

When doing a non blocking split to the same node, and the shard group is large (e.g. 1TB) it will copy all this data to the local node and thus create just as much WAL. This WAL can then not be removed by Postgres, because the replication slot is not being consumed yet during the COPY. I'm not sure if there's much we can do here. But we should probably document this, and if/when we include shardsplits into the rebalancer we should take this kind of thing into account.

Something similar was happening for this user when splitting to the coordinator. I'm not super sure what the cause of that is, maybe WAL could not be removed on the coordinator because the transaction that was doing the shard split was still open.

Reported on slack (long thread): https://citus-public.slack.com/archives/C0XRHT1KJ/p1691573609184889

Aug 16 '23 13:08 JelteF

In function like EnsureEnoughDiskSpaceForShardMove() we do not check WAL space used (or estimate future usage). Looks easy to fix here, or in a similar way in other places. Correct ?

Oct 24 '23 10:10 c2main

I like that idea. But I think we would need a new EnsureEnoughDiskSpaceForShardSplit function and call that for shard splits, instead of reusing the EnsureEnoughDiskSpaceForShardMove function for this. This new function should then take into account that at least 2x the space of the original shard would be needed when splitting all shards to the current node (1 time for the splitted shards + 1x for the temporary WAL that cannot be consumed until the copy is complete.

Oct 24 '23 11:10 JelteF

I like that idea. But I think we would need a new EnsureEnoughDiskSpaceForShardSplit function and call that for shard splits, instead of reusing the EnsureEnoughDiskSpaceForShardMove function for this. This new function should then take into account that at least 2x the space of the original shard would be needed when splitting all shards to the current node (1 time for the split shards + 1x for the temporary WAL that cannot be consumed until the copy is complete.

I second you for a new dedicated function. With wal_compression it might be less than twice the size (I suppose that such split/move will use a lof of full_page_write in WAL).

Oct 24 '23 12:10 c2main

I guess it's worth testing that. But afaik full page writes don't apply her. Afaik full page writes only happen after checkpoint when updating existing pages after a checkpoint, and in that case they are in addition to the actual updates in the wal, not instead of. Updates to existing pages shouldn't happen during the COPY phase though, since it should only be inserting.

Oct 24 '23 13:10 JelteF

f_p_w happens on any first modification after checkpoint, true. But the image contains the "change". Now I have a doubt about how it is handled with COPY...

Oct 24 '23 13:10 c2main

citus citus copied to clipboard

Shard splitting to the same node can make node run out of disk space due to lots of unconsumed

citus
citus copied to clipboard