[core] Fix scan metric report for extra file-index files
Purpose
Reading table files contains scan phase and partition read phase. When we use bloom-filter or other file indexs, we found the scan metrics are the total data files of table, the file index seem to be not effective.
After analysis, in scan phase, the file index evaluation works just effectively for embedded file index, it does not work for extra file index. But extra file index actually works in partition read phase, this results inaccurate reporting metrics.
Tests
API and Format
Documentation
@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?
@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?
yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.
@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?
yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.
I think it was done on purpose. During the scanning phase, we aim to evaluate as quickly as possible, so we only evaluate the embedding file index, which is stored in the manifest. The extra file index, however, is stored independently and requires additional I/O to load, it may slow down the scanning process.
@Akwangg Thank you for your contribution! Did you mean evaluating the extra file index during the scanning phase?
yes, scan phase only evaluating the embedded file index, the scan result is incorrect for extra file index.
I think it was done on purpose. During the scanning phase, we aim to evaluate as quickly as possible, so we only evaluate the embedding file index, which is stored in the manifest. The extra file index, however, is stored independently and requires additional I/O to load, it may slow down the scanning process.
If so, the reported scan metrics are the total table files, but some files are filtered when the data is actually read, which can easily lead to misunderstand that the index is not effective. Is there a better solution?
If so, the reported scan metrics are the total table files, but some files are filtered when the data is actually read, which can easily lead to misunderstand that the index is not effective. Is there a better solution?
We can introduce some metrics in the file index evaluation phase, WDYT?