API,Core: Introduce metrics for data files by file format
There is an existing metric to hold the total number of non-skipped data files during a planFiles() call. This patch adds new metrics to break that number up by file format (Avro, ORC, Parquet). These new metrics could come handy for profiling queries on mixed file format tables.
@gaborkaszab, what is the use case for these metrics? We haven't seen very many people actually mixing file types in a table so I'd be interested in how this is intended to be used.
Hi @rdblue, We have recently implemented the support for mixed file format Iceberg tables in Impala: https://issues.apache.org/jira/browse/IMPALA-10610 One use case would be to gradually migrate the files of a table from one file format to another. (Motivation is that Impala is more performant with Parquet than ORC.)
@nastra As you are very familiar with this area, would you mind taking a quick look?
the code changes themselves LGTM but I'm still not sure that this is how we'd want to represent dimensions in metrics as this doesn't really scale to add a new metric field for each new dimension (where the file format is the dimension we use here).
I think it would be interesting to explore we we could actually represent such dimensions in a more natural and scalable way.
Let me give an update on this: At first I tried to introduce a Map<FileFormat, Counter> for this dimensional metric in ScanMetrics, but for me to make this work I ended up writing code to convert this into MAP<FileFormat, CounterResult> plus in ScanMetricsResultParser.fromJson() we still need to have the name of each filed (including the file format name) that isn't generic unfortunately. This approach seemed overcomplicated and still not 100% generic to handle automatically e.g. adding a new dimension. I'll give it a try to create something like a MultiDimensionCounter and let's see if it turns out better.
Quick update: I have a PoC for the new MultiDimensionCounter<T> type. There are some rough edges but apart from the Json conversions seems to work fine. The from-Json part is a bit tricky because the code has to be aware what the generic parameter type is for the multi-dimensional counter so I had to also write this info into Json and use reflection to recreate the type. Apart from this the ScanReportParser and ScanMetricsResultParser tests fail because the order of the Json elements seems to matter for the asserts but the order of the elements in the multi-dimensional counter is not guaranteed. Working on fixing these and I'll upload new patch. This is where I'm now: https://github.com/gaborkaszab/iceberg/commit/b268f65f3d588f30c637aa3db3f53be2a5f4c80d
Thanks for the review on https://github.com/gaborkaszab/iceberg/commit/b268f65f3d588f30c637aa3db3f53be2a5f4c80d @nastra! I think I addressed most of them.
The changes since the last patch:
- Made the MultiDimensionCounter non-generic.
- The MultiDimensionCounter receives a name in the constructor.
- The key for the internal counters is created from the MultiDimensionCounter name plus the name of the dimension.
- Made a MultiDimensionCounter interface plus a DefaultMultiDimensionCounter implementation for it.
- There is a NOOP implementation as well.
- The DefaultMultiDimensionCounter lives in ScanMetrics but is created through MetricsContext. I believe this is in line with how the other Counters and Timers are handled.
- The counters for the dimensions still live within the DefaultMultiDimensionCounter internally, but they are also created through MetricsContext. I figured this is the most natural way to represent the multi-dimensions.
- The MultiDimensionCounterResult has a function to aggregate the results for all dimensions.
- Updated the existing tests to cover the new counters, also to/from Json conversions are taken care. All tests should pass now.
- Wrote some new tests as well to have more coverage on the new code.
What is remaining: Currently there is only support for one dimension. I still have to figure out how this would look like having more than one dimensions. I think it has to satisfy the following requirements:
- The number of dimensions should be decided during instantiation of the multi-counter. Maybe constructor param?
- The increase() and value() functions should make sure that they receive as many "dimension key"s as many dimensions were decided in the constructor.
- The key for the internal counters could be constructed from the name of the multi-counter plus appending the name of the dimensions. I think this should be enough to have unique keys when creating through MetricsContext.
- The order of the dimensions matters.
Thanks again for the review, @nastra! I believe I have addressed all of your previous comments. Another round of review would be appreciated :)
thanks @gaborkaszab, will try and review it this week. You might also want to include @rdblue or @danielcweeks for a review.
Thanks for the comments, @nastra! I believe I've addressed all of them.
(plus did a rebase with master to resolve conflicts)
Hey @nastra, Would you mind taking another look?
Additionally, @danielcweeks @rdblue would you mind taking a look?
Hey @rdblue, @danielcweeks, @nastra, would you mind taking a look at this?
Oops, I did a rebase with master and accidentally closed this PR. Let me re-open
I did a rebase because there has been some conflicts. Could we somehow get this PR going? @nastra already took a look few months ago. @rdblue @danielcweeks Could you please take a look?
Just looping in more committers, hoping that someone might have some spare cycles, cc @pvary @szehon-ho @RussellSpitzer
Hey @nastra, First of all let me congratulate for your promotion to committer! Would you mind continue the review on this PR? I remember there was a point where you told me to get @danielcweeks and @rdblue involved as we would need them to merge it anyway but I guess this is no longer an issue :)
Thanks @gaborkaszab. I think it still makes sense to have @danielcweeks and/or @rdblue review this PR as they always can provide meaningful input which ultimately makes the outcome better.
Hey @rdblue , @danielcweeks could you please help me how we can make this PR moving again? I've put some efforts into it and also would be a nice addition into Impala's planner.
Hey @aokolnychyi, I saw you were pretty active on other PRs. I'm wondering if you have any spare cycles to take a quick look at this as well.
Hey @nastra , @rdblue , @danielcweeks , @jbonofre , It's been a while since I worked on this PR but it got to my radar again now. Would it be possible for any of you to take a look?
I'm not sure how to proceed here. Let me try to involve another set of committers with the hope that someone eventually take a look. cc @pvary @aokolnychyi @szehon-ho Would you mind taking a look? I can also give a heads-up on the history of this PR if needed.
Let me involve an even wider set of committers here since this has been open for a while now. Hopefully someone has some spare time to make this going again. Any reviews are appreciated!
@nastra @rdblue @danielcweeks @jbonofre @pvary @aokolnychyi @szehon-ho @flyrain @amogh-jahagirdar @Fokko @jackye1995 @liurenjie1024 @RussellSpitzer