[FLINK-31192][connectors/dataGen] Fix dataGen takes too long to initi…
What is the purpose of the change
In the sequence mode, the dataGen field will preload all the sequence values when it is initialized, and it will take a long time during this period
Brief change log
- Adapt the way of preloading the value to generate a sequence value every time the next method is called
- The checkpoint saves not all the sequence values, but the position where the current task sends the sequence value
Verifying this change
DataGeneratorSourceTest and DataGenTableSourceFactoryTest
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving): no - The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
- The S3 file system connector: no
Documentation
- Does this pull request introduce a new feature? no
@reswqa Can you review it for me, Thank you.
CI report:
- ec135c137777197b23ffa20dc488679adbe50020 UNKNOWN
- c911719877f16d656a473c68d0d03748e45dfc1d Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
Thanks @xuzhiwen1255 for creating this, It seems that the CI is failure, PTAL first.
@flinkbot run azure
@reswqa Re-running the cli is no problem.
@reswqa CI is fine now, can you come over and have a look?
@RyanSkraba @XComp Please check again, thanks.
@flinkbot run azure
On another note: We should rebase the branch to include FLINK-31962 which would make CI succeed again.
In addition to the comments, I feel like we're lacking some tests here. Is there a way to unit test the generator (maybe even in the context of a parallelism change)?
I think about it, the change of parallelism already exists in the existing unit test, I think we don't need to add the test, because it already includes the complete generated test case of the sequence.
In this DataGeneratorSourceTest#innerTestDataGenCheckpointRestore test method, the degree of parallelism is modified .
I think about it, the change of parallelism already exists in the existing unit test, I think we don't need to add the test, because it already includes the complete generated test case of the sequence.
Thanks for pointing that out. I missed DataGeneratorSourceTest. Do you still see the value in having a SequenceGeneratorTest testing the basic functionality of the SequenceGenerator, though? It might help to understand the contract of the SequenceGenerator class.
I think about it, the change of parallelism already exists in the existing unit test, I think we don't need to add the test, because it already includes the complete generated test case of the sequence.
Thanks for pointing that out. I missed
DataGeneratorSourceTest. Do you still see the value in having aSequenceGeneratorTesttesting the basic functionality of theSequenceGenerator, though? It might help to understand the contract of theSequenceGeneratorclass.
I don't think it is necessary to use SequenceGeneratorTest for testing, because the existing tests have covered its basic functions. Later, we may need to test how to avoid data skew and test sequence generation integrity when adjusting operator parallelism.
@XComp WDYT?
@flinkbot run azure
@XComp @RyanSkraba Any more questions?
Hi @xuzhiwen1255 , sorry for making you wait for so long. I was on vacation and have to focus on some other work right now. I try to get back to this PR soon'ish
Hi @xuzhiwen1255 , sorry for making you wait for so long. I was on vacation and have to focus on some other work right now. I try to get back to this PR soon'ish
@XComp It's all right, have a nice holiday.
I still think that it might make sense to introduce a
SequenceGeneratorTest. This would allow us to test the edge cases easily (e.g. reachingLong.MAX_VALUE) more specifically.
In #21971 we added SequenceGeneratorTest, I think we can add the corresponding test method in #21971
@XComp WDYT?
In https://github.com/apache/flink/pull/21971 we added SequenceGeneratorTest, I think we can add the corresponding test method in https://github.com/apache/flink/pull/21971
I'm not a big fan of adding tests that belong in a PR into a separate PR. You could just create the test class in this PR with the set of tests that fit to this PR. PR #21971 should then add additional tests related to its changes. The conflict resolution shouldn't be too complex because your adding independent test methods. :thinking:
I'm not a big fan of adding tests that belong in a PR into a separate PR. You could just create the test class in this PR with the set of tests that fit to this PR. PR #21971 should then add additional tests related to its changes. The conflict resolution shouldn't be too complex because your adding independent test methods.
OK let me add a test.
Any updates on that one @xuzhiwen1255 ? :slightly_smiling_face:
Any updates on that one @xuzhiwen1255 ? 🙂
@XComp Sorry, the company has been busy iterating recently. I will update it in the next few days.
No worries. Especially considering that I won't be that responsive in September in general. I just wanted to let you know to make prioritization of work easier for you.
@XComp I'm sorry that I haven't had time to deal with this problem recently due to some things. Now I plan to continue to deal with this issue. Do you have time to take a look?
I think so. I will modify it as soon as possible.
@flinkbot run azure
Hi @XComp I have solved the problem, please check again, thank you.