Make ParquetFileReader extensible
Describe the enhancement requested
I wanted to add a cache for ParquetFileReader.getColumnIndexStore(int blockIndex) and I've noticed that I can't access the required blocks field from a subclass.
To accomplish this I had to fallback to reflection as can be seen here:
https://github.com/linkedfactory/linkedfactory-pod/blob/c67b155e3fbf76082e2b24fd412efae5838badc6/bundles/io.github.linkedfactory.core/src/main/java/io/github/linkedfactory/core/kvin/parquet/KvinParquet.java#L881
I would propose to improve the extensibility of ParquetFileReader by changing certain core members (like blocks) to protected or by exposing them through protected getters.
Component(s)
No response
It doesn't make a big difference to change private members to protected or public as we enforce backward compatibility for all public APIs. BTW, is it possible to introduce similar approach like https://github.com/apache/parquet-java/pull/1174
@wgtmac Do you also see the possibility to cache uncompressed data pages somehow? If enough memory is available then this could help to reduce access times if data within the same range is requested.
If you do need to cache (compressed or uncompressed) pages, I think it is unavoidable to implement some sort of page cache. Do you want to contribute this?
Maybe I can contribute. Do you think a good start would be to modify ParquetFileReader.readAllPages and either cache all pages in a column chunk or the individual pages?
In both cases I'm not yet sure what the correct cache key would be.
It would be good to make it configurable to set different cache strategies including compressed/uncompressed, requested columnIds, eviction rule, etc. For the cache key, my idea is to suggest a cache key prefix via configuration (by default it can be the file path) and then concatenate it with row_group_id, column_id, and page_id. WDYT?
@gszadovszky @Fokko may have good advices too.
I'm not sure I completely understand the use case. Configurable, "under the hood" caching only makes sense if we think it improves the workflow when parquet-java is used more like a black box. Also, we probably want to expose such configs to the end users to fine tune the engine based on their data? But in this case we would need some analysis to prove caching really improves performance.
In the other hand if caching is not generally useful but in some specific scenarios where some custom code takes place at parquet-java level, I would vote on making ParquetFileReader extensible or even implement the caching mechanism that can be used/adjusted from the java API directly.
@gszadovszky My use case is repeated access to the same Parquet files where historical time-series data is retrieved. Our SPARQL-based query engine allows joins between time-series points, e.g., "Find the current state of machine X when event Y occurred.". If the machine states are close to each other (maybe within the same data page) then caching would help much.
BTW, we store the same data in LevelDB which is a lot faster than accessing the Parquet files. I think the main point here is that LevelDB caches decompressed blocks in memory.
Thanks @kenwenzel for the clarification. It seems your use case is more "black box" like, meaning your not using parquet-java from the java API, but only transitively via Spark. So you would benefit from the configurable approach more.
What I would like to avoid is to implement a caching logic, that might be configurable, still useful for only a specific scenario. Like we are talking about local, in-memory cache, I guess. What about distributed caching? Maybe spilling?
I think I would push this request back a bit to the original idea of making parquet-java extendable instead of getting in an actual implementation for caching. We may still implement some configuration where the user can hook their implementation into parquet-java.
@wgtmac, WDYT?
@gszadovszky I use it directly via Java API: https://github.com/linkedfactory/linkedfactory-pod/blob/f47949823cb96e92e2754f02d411c6290cbef63e/bundles/io.github.linkedfactory.core/src/main/java/io/github/linkedfactory/core/kvin/parquet/KvinParquet.java#L852
The challenge is that ParquetFileReader is not extensible. If I want to add caching then I have to duplicate its code and also code of other related classes.
Thanks, @kenwenzel, I've misunderstood the situation then. So, I would keep this issue to it's original title and description to make the parquet-java API more extensible. Feel free to put up a PR if you want to contribute. I'm happy to review.
After that, if we think that a specific caching approach would be good for other systems as well, feel free to put up another proposal for it.
I agree with @gszadovszky. Let's start simple.