lightning: change the implementation of `ScannedPos`
What problem does this PR solve?
Issue Number: ref https://github.com/pingcap/tidb/issues/61088
Problem Summary:
What changed and how does it work?
In the encode step, we will track the amount of data read and show it on the panel. This is achieved by calling Parser.ScannedPos to get the latest read position. However, the current implement is wrong:
https://github.com/pingcap/tidb/blob/11276faa9dc05ed7047f766a4c1fb45cd2b83109/pkg/lightning/mydump/parquet_parser.go#L377-L379
Because we will open one reader for each column, Seek can't reflect the actual bytes we read. Below is a simple illustration.
File layout (Dict page and meta are ignored)
--------------------------------------------------------------------------------------------------
| Column 0 Group 0 | Column 1 Group 0 | Column 0 Group 1 | Column 1 Group 1 |
--------------------------------------------------------------------------------------------------
^ ^
| |
Reader 0 Reader 1
Since it's hard to retrieve total scanned bytes for the parser, we used the size of read rows instead which is better than using Seek.
Additionally, we use ScannedPos to update metrics for all types of data.
Check List
Tests
- [ ] Unit test
- [ ] Integration test
- [ ] Manual test (add detailed scripts or steps below)
- [X] No need to test
- [ ] I checked and no code files have been changed.
As mentioned before, this interface is used to update the metrics, so changes in this PR won't affect the execution of import jobs.
Side effects
- [ ] Performance regression: Consumes more CPU
- [ ] Performance regression: Consumes more Memory
- [ ] Breaking backward compatibility
Documentation
- [ ] Affects user behaviors
- [ ] Contains syntax changes
- [ ] Contains variable changes
- [ ] Contains experimental features
- [ ] Changes MySQL compatibility
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
None
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign benjamin2037 for approval. For more information see the Code Review Process. Please ensure that each of them provides their approval before proceeding.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Hi @joechenrh. Thanks for your PR.
PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Codecov Report
:x: Patch coverage is 84.00000% with 4 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 73.4832%. Comparing base (4419a28) to head (e951863).
:warning: Report is 305 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #61670 +/- ##
================================================
+ Coverage 73.0715% 73.4832% +0.4116%
================================================
Files 1729 1760 +31
Lines 481120 492737 +11617
================================================
+ Hits 351562 362079 +10517
- Misses 108028 108373 +345
- Partials 21530 22285 +755
| Flag | Coverage Δ | |
|---|---|---|
| integration | 44.9431% <36.0000%> (?) |
|
| unit | 72.2741% <72.2222%> (-0.0552%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Components | Coverage Δ | |
|---|---|---|
| dumpling | 52.7804% <ø> (ø) |
|
| parser | ∅ <ø> (∅) |
|
| br | 45.9046% <ø> (-1.0389%) |
:arrow_down: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
/retest
@joechenrh: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.
In response to this:
/retest
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@joechenrh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-lightning-integration-test | e9518631417e287273b364bf9d6993420f8ac60a | link | true | /test pull-lightning-integration-test |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
To some extent, the value get from ScannedPos won't affect the progress we calculate too much.
Because for each parser, the ScannedPos points to the start position of meta part and never changed after opening (because we will open a new reader for eacn column). As the meta is small, the delta we get is almost the same as the file size. 😂
So just keep the current logic and fix it after switching to new parse library.