tidb lightning: change the implementation of `ScannedPos`

What problem does this PR solve?

Issue Number: ref https://github.com/pingcap/tidb/issues/61088

Problem Summary:

What changed and how does it work?

In the encode step, we will track the amount of data read and show it on the panel. This is achieved by calling Parser.ScannedPos to get the latest read position. However, the current implement is wrong:

https://github.com/pingcap/tidb/blob/11276faa9dc05ed7047f766a4c1fb45cd2b83109/pkg/lightning/mydump/parquet_parser.go#L377-L379

Because we will open one reader for each column, Seek can't reflect the actual bytes we read. Below is a simple illustration.

File layout (Dict page and meta are ignored)
--------------------------------------------------------------------------------------------------
|   Column 0 Group 0    |    Column 1 Group 0    |   Column 0 Group 1    |   Column 1 Group 1    | 
--------------------------------------------------------------------------------------------------
                ^                       ^
                |                       |
            Reader 0                 Reader 1

Since it's hard to retrieve total scanned bytes for the parser, we used the size of read rows instead which is better than using Seek.

Additionally, we use ScannedPos to update metrics for all types of data.

Check List

Tests

[ ] Unit test
[ ] Integration test
[ ] Manual test (add detailed scripts or steps below)
[X] No need to test
- [ ] I checked and no code files have been changed.

As mentioned before, this interface is used to update the metrics, so changes in this PR won't affect the execution of import jobs.

Side effects

[ ] Performance regression: Consumes more CPU
[ ] Performance regression: Consumes more Memory
[ ] Breaking backward compatibility

Documentation

[ ] Affects user behaviors
[ ] Contains syntax changes
[ ] Contains variable changes
[ ] Contains experimental features
[ ] Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Jun 11 '25 10:06 joechenrh

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign benjamin2037 for approval. For more information see the Code Review Process. Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jun 11 '25 10:06 ti-chi-bot[bot]

Hi @joechenrh. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 11 '25 10:06 tiprow[bot]

Codecov Report

:x: Patch coverage is 84.00000% with 4 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 73.4832%. Comparing base (4419a28) to head (e951863). :warning: Report is 305 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #61670        +/-   ##
================================================
+ Coverage   73.0715%   73.4832%   +0.4116%     
================================================
  Files          1729       1760        +31     
  Lines        481120     492737     +11617     
================================================
+ Hits         351562     362079     +10517     
- Misses       108028     108373       +345     
- Partials      21530      22285       +755

Flag	Coverage Δ
integration	`44.9431% <36.0000%> (?)`
unit	`72.2741% <72.2222%> (-0.0552%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`52.7804% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`45.9046% <ø> (-1.0389%)`	:arrow_down:

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Jun 11 '25 10:06 codecov[bot]

/retest

Jun 17 '25 03:06 joechenrh

@joechenrh: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 17 '25 03:06 tiprow[bot]

@joechenrh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-lightning-integration-test	e9518631417e287273b364bf9d6993420f8ac60a	link	true	`/test pull-lightning-integration-test`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Jun 17 '25 03:06 ti-chi-bot[bot]

To some extent, the value get from ScannedPos won't affect the progress we calculate too much.

Because for each parser, the ScannedPos points to the start position of meta part and never changed after opening (because we will open a new reader for eacn column). As the meta is small, the delta we get is almost the same as the file size. 😂

So just keep the current logic and fix it after switching to new parse library.

Jul 30 '25 04:07 joechenrh