dsbulk Unload big partitions: automatically tune schema splits and ops per second on retry for timeouts

Unload big partitions: automatically tune schema splits and ops per second on retry for timeouts

Open phact opened this issue 2 years ago • 3 comments

Users dumping entire tables often hit timeouts when they reach large partitions. The solution is to manually tune splits and throughput until the unload works but this is very time consuming and error prone.

Would be great if dsbulk could handle this common scenario by itself.

┆Issue is synchronized with this Jira Task by Unito

Aug 10 '22 18:08 phact

Related: #448.

Aug 10 '22 19:08 adutra

I don't think tuning splits would make a big difference, and btw, that's near impossible since the splits determine how many taken ranges are going to be read, so this happens at a very early phase.

But tuning throughput, yes, definitely. Probably based on latencies, and probably governed by a high/low watermark system.

Aug 10 '22 19:08 adutra

I don't think tuning splits would make a big difference

It does. This is how I've had to do things many times when dsbulk unload fails.

The reason is usually a big partition, smaller splits can help it actually finish. Sometimes if that doesn't do the trick we end up having to bisect the range around it and then throttle.

Aug 11 '22 16:08 phact

dsbulk dsbulk copied to clipboard

Unload big partitions: automatically tune schema splits and ops per second on retry for timeouts

dsbulk
dsbulk copied to clipboard