parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

Make ParquetFileReader extensible

Open kenwenzel opened this issue 1 year ago • 11 comments

Describe the enhancement requested

I wanted to add a cache for ParquetFileReader.getColumnIndexStore(int blockIndex) and I've noticed that I can't access the required blocks field from a subclass. To accomplish this I had to fallback to reflection as can be seen here: https://github.com/linkedfactory/linkedfactory-pod/blob/c67b155e3fbf76082e2b24fd412efae5838badc6/bundles/io.github.linkedfactory.core/src/main/java/io/github/linkedfactory/core/kvin/parquet/KvinParquet.java#L881

I would propose to improve the extensibility of ParquetFileReader by changing certain core members (like blocks) to protected or by exposing them through protected getters.

Component(s)

No response

kenwenzel avatar Aug 30 '24 11:08 kenwenzel

It doesn't make a big difference to change private members to protected or public as we enforce backward compatibility for all public APIs. BTW, is it possible to introduce similar approach like https://github.com/apache/parquet-java/pull/1174

wgtmac avatar Aug 31 '24 02:08 wgtmac

@wgtmac Do you also see the possibility to cache uncompressed data pages somehow? If enough memory is available then this could help to reduce access times if data within the same range is requested.

kenwenzel avatar Mar 22 '25 09:03 kenwenzel

If you do need to cache (compressed or uncompressed) pages, I think it is unavoidable to implement some sort of page cache. Do you want to contribute this?

wgtmac avatar Mar 22 '25 12:03 wgtmac

Maybe I can contribute. Do you think a good start would be to modify ParquetFileReader.readAllPages and either cache all pages in a column chunk or the individual pages? In both cases I'm not yet sure what the correct cache key would be.

kenwenzel avatar Mar 23 '25 07:03 kenwenzel

It would be good to make it configurable to set different cache strategies including compressed/uncompressed, requested columnIds, eviction rule, etc. For the cache key, my idea is to suggest a cache key prefix via configuration (by default it can be the file path) and then concatenate it with row_group_id, column_id, and page_id. WDYT?

@gszadovszky @Fokko may have good advices too.

wgtmac avatar Mar 23 '25 13:03 wgtmac

I'm not sure I completely understand the use case. Configurable, "under the hood" caching only makes sense if we think it improves the workflow when parquet-java is used more like a black box. Also, we probably want to expose such configs to the end users to fine tune the engine based on their data? But in this case we would need some analysis to prove caching really improves performance. In the other hand if caching is not generally useful but in some specific scenarios where some custom code takes place at parquet-java level, I would vote on making ParquetFileReader extensible or even implement the caching mechanism that can be used/adjusted from the java API directly.

gszadovszky avatar Mar 24 '25 06:03 gszadovszky

@gszadovszky My use case is repeated access to the same Parquet files where historical time-series data is retrieved. Our SPARQL-based query engine allows joins between time-series points, e.g., "Find the current state of machine X when event Y occurred.". If the machine states are close to each other (maybe within the same data page) then caching would help much.

BTW, we store the same data in LevelDB which is a lot faster than accessing the Parquet files. I think the main point here is that LevelDB caches decompressed blocks in memory.

kenwenzel avatar Mar 24 '25 09:03 kenwenzel

Thanks @kenwenzel for the clarification. It seems your use case is more "black box" like, meaning your not using parquet-java from the java API, but only transitively via Spark. So you would benefit from the configurable approach more.

What I would like to avoid is to implement a caching logic, that might be configurable, still useful for only a specific scenario. Like we are talking about local, in-memory cache, I guess. What about distributed caching? Maybe spilling?

I think I would push this request back a bit to the original idea of making parquet-java extendable instead of getting in an actual implementation for caching. We may still implement some configuration where the user can hook their implementation into parquet-java.

@wgtmac, WDYT?

gszadovszky avatar Mar 24 '25 15:03 gszadovszky

@gszadovszky I use it directly via Java API: https://github.com/linkedfactory/linkedfactory-pod/blob/f47949823cb96e92e2754f02d411c6290cbef63e/bundles/io.github.linkedfactory.core/src/main/java/io/github/linkedfactory/core/kvin/parquet/KvinParquet.java#L852

The challenge is that ParquetFileReader is not extensible. If I want to add caching then I have to duplicate its code and also code of other related classes.

kenwenzel avatar Mar 24 '25 15:03 kenwenzel

Thanks, @kenwenzel, I've misunderstood the situation then. So, I would keep this issue to it's original title and description to make the parquet-java API more extensible. Feel free to put up a PR if you want to contribute. I'm happy to review.

After that, if we think that a specific caching approach would be good for other systems as well, feel free to put up another proposal for it.

gszadovszky avatar Mar 24 '25 15:03 gszadovszky

I agree with @gszadovszky. Let's start simple.

wgtmac avatar Mar 25 '25 07:03 wgtmac