paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[core] Fix scan metric report for extra file-index files

Open Askwang opened this issue 5 months ago • 5 comments

Purpose

Reading table files contains scan phase and partition read phase. When we use bloom-filter or other file indexs, we found the scan metrics are the total data files of table, the file index seem to be not effective.

After analysis, in scan phase, the file index evaluation works just effectively for embedded file index, it does not work for extra file index. But extra file index actually works in partition read phase, this results inaccurate reporting metrics.

Tests

API and Format

Documentation

Askwang avatar Jul 22 '25 09:07 Askwang

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

Tan-JiaLiang avatar Jul 23 '25 08:07 Tan-JiaLiang

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.

Askwang avatar Jul 23 '25 08:07 Askwang

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.

I think it was done on purpose. During the scanning phase, we aim to evaluate as quickly as possible, so we only evaluate the embedding file index, which is stored in the manifest. The extra file index, however, is stored independently and requires additional I/O to load, it may slow down the scanning process.

Tan-JiaLiang avatar Jul 23 '25 09:07 Tan-JiaLiang

@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?

yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.

I think it was done on purpose. During the scanning phase, we aim to evaluate as quickly as possible, so we only evaluate the embedding file index, which is stored in the manifest. The extra file index, however, is stored independently and requires additional I/O to load, it may slow down the scanning process.

If so, the reported scan metrics are the total table files, but some files are filtered when the data is actually read, which can easily lead to misunderstand that the index is not effective. Is there a better solution?

Askwang avatar Jul 23 '25 09:07 Askwang

If so, the reported scan metrics are the total table files, but some files are filtered when the data is actually read, which can easily lead to misunderstand that the index is not effective. Is there a better solution?

We can introduce some metrics in the file index evaluation phase, WDYT?

Tan-JiaLiang avatar Jul 23 '25 09:07 Tan-JiaLiang