doris-spark-connector icon indicating copy to clipboard operation
doris-spark-connector copied to clipboard

[Improvement] no need to wait after a successful retry

Open fornaix opened this issue 1 year ago • 7 comments

Problem Summary:

When using Utils.retry, even if it succeeds, we still need to wait for doris.sink.batch.interval.ms milliseconds.

This pr is to fix it.

Checklist(Required)

  1. Does it affect the original behavior: (No)
  2. Has unit tests been added: (No)
  3. Has document been added or modified: (No Need)
  4. Does it need to update dependencies: (No)
  5. Are there any changes that cannot be rolled back: (No)

fornaix avatar Sep 08 '23 10:09 fornaix

This configuration is to prevent exceptions caused by too frequent imports. What problems will the pause between batches cause to your job?

gnehil avatar Sep 11 '23 10:09 gnehil

This configuration is to prevent exceptions caused by too frequent imports. What problems will the pause between batches cause to your job?

Got it, thx. @gnehil If we increase the retry interval, we will wait a long time after each insertion. Maybe it would be better to split it into two parameters?

fornaix avatar Sep 12 '23 03:09 fornaix

This configuration is to prevent exceptions caused by too frequent imports. What problems will the pause between batches cause to your job?

Got it, thx. @gnehil If we increase the retry interval, we will wait a long time after each insertion. Maybe it would be better to split it into two parameters?

You can reduce the batch loading interval by setting the doris.sink.batch.interval.ms option. The default value of this option is 50 (ms). Or you can set it to 0, so there will be no interval between batches. And can you briefly describe the idea of "split into two parameters"?

gnehil avatar Sep 13 '23 02:09 gnehil

@gnehil I means providing two parameters:

  1. doris.sink.batch.interval.ms: Control the batch flush interval
  2. doris.sink.retry.interval.ms: Control the retry interval

When the retry interval is increased, it will not affect the batch flush interval.

fornaix avatar Sep 13 '23 07:09 fornaix

@gnehil I means providing two parameters:

  1. doris.sink.batch.interval.ms: Control the batch flush interval
  2. doris.sink.retry.interval.ms: Control the retry interval

When the retry interval is increased, it will not affect the batch flush interval.

Good idea, you can submit PR for this

gnehil avatar Sep 14 '23 09:09 gnehil

cc @gnehil

fornaix avatar Sep 15 '23 08:09 fornaix

Iterator retry will lose data, refer to pr https://github.com/apache/doris-spark-connector/pull/145

JNSimba avatar Oct 10 '23 03:10 JNSimba